hw04

pdf

School

University of Alabama *

*We aren’t endorsed by this school

Course

404

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

32

Uploaded by ChiefHeat13487

Report
0.0.1 Question 1a What is the granularity of the data (i.e., what does each row represent)? Hint: Examine all variables present in the dataset carefully before answering this question! Each row represents hourly data from a bike shop. 1
2
0.0.2 Question 1b For this assignment, we’ll be using this data to study bike usage in Washington, DC. Based on the granularity and the variables present in the data, what might some limitations of using this data be? What are two additional data categories/variables that one could collect to address some of these limitations? We could lack information regarding the duration of the user’s bike ride due to the absence of a variable monitoring the time spent riding. One way to address this is by introducing two variables to capture the average durations of bike rides for both casual and registered users. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4
0.0.3 Question 3a Use the sns.histplot (documentation) function to create a plot that overlays the distribution of the daily counts of bike users, using blue to represent casual riders, and green to represent registered riders. The temporal granularity of the records should be daily counts, which you should have after completing question 2.c. In other words, you should be using daily_counts to answer this question. Hints: - You will need to set the stat parameter appropriately to match the desired plot. - The label parameter of sns.histplot allows you to specify, as a string, how the plot should be labeled in the legend. For example, passing in label="My data" would give your plot the label “My data” in the legend. - You will need to make two calls to sns.histplot . Include a legend , xlabel , ylabel , and title . Read the seaborn plotting tutorial if you’re not sure how to add these. After creating the plot, look at it and make sure you understand what the plot is actually telling us, e.g., on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000. For all visualizations in Data 100, our grading team will evaluate your plot based on its similarity to the provided example. While your plot does not need to be identical to the example shown, we do expect it to capture its main features, such as the general shape of the distribution , the axis labels , the legend , and the title . It is okay if your plot contains small stylistic differences, such as differences in color, line weight, font, or size/scale. In [57]: sns . histplot(data = daily_counts[ 'casual' ], stat = 'density' , kde = True , label = 'casual' ) sns . histplot(data = daily_counts[ 'registered' ], stat = 'density' , color = 'green' , kde = True , label = 'registered' ) plt . xlabel( 'Rider Count' ) plt . title( 'Distribution Comparison of Casual vs Registered Riders' ) plt . legend() Out[57]: <matplotlib.legend.Legend at 0x7f3dd03c5e10> 5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.0.4 Question 3b In the cell below, describe the differences you notice between the density curves for casual and registered riders. Consider concepts such as modes, symmetry, skewness, tails, gaps, and outliers. Include a comment on the spread of the distributions. The distribution of registered riders seems to be relatively symmetrical, whereas the distribution for casual riders is skewed to the right. Additionally, the spread of the registered riders’ distribution is wider than that of casual riders. In the registered riders’ distribution, both the right and left tails are approximately at the same level, whereas in the casual riders’ distribution, the left tail is the highest peak, and the right tail is the lowest. 7
8
0.0.5 Question 3c The density plots do not show us how the counts for registered and casual riders vary together. Use sns.lmplot (documentation) to make a scatter plot to investigate the relationship between casual and registered counts. This time, let’s use the bike DataFrame to plot hourly counts instead of daily counts. The lmplot function will also try and draw a linear regression line (just as you saw in Data 8). Color the points in the scatterplot according to whether or not the day is a working day (your colors do not have to match ours exactly, but they should be different based on whether the day is a working day). Hints: * Check out this helpful tutorial on lmplot . * There are many points in the scatter plot, so make them small to help reduce overplotting. Check out the scatter_kws parameter of lmplot . * Generate and plot the linear regression line by setting a parameter of lmplot to True . Can you find this in the documentation? We will discuss the concept of linear regression later in the course. * You can set the height parameter if you want to adjust the size of the lmplot . * Add a descriptive title and axis labels for your plot. * You should be using the bike DataFrame to create your plot. * It is okay if the scales of your x and y axis (i.e., the numbers labeled on the two axes) are different from those used in the provided example. In [58]: sns . set(font_scale =1 ) # This line automatically makes the font size a bit bigger on the plot. Y sns . lmplot(data = bike, x = 'casual' , y = 'registered' , hue = 'workingday' , height =6 , scatter_kws = { 's' : 2 , 'alpha' : 1 }) plt . title( 'Comparison of Casual vs Registered Riders on Working and Non-working Days' ) Out[58]: Text(0.5, 1.0, 'Comparison of Casual vs Registered Riders on Working and Non-working Days') 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10
0.0.6 Question 3d What does this scatterplot seem to reveal about the relationship (if any) between casual and registered riders and whether or not the day is on the weekend? What effect does overplotting have on your ability to describe this relationship? There appears to be a positive correlation between the number of casual and registered riders. The regression lines’ slope, illustrating the relationship between casual and registered rider counts, is more pronounced on workdays compared to weekends. The extensive data points make it challenging to clearly discern the relationship between workdays and rider types, somewhat obscuring this correlation. 11
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.0.7 Question 4a (Bivariate Kernel Density Plot) Generate a bivariate kernel density plot with workday and non-workday separated using the daily_counts DataFrame . Hints: You only need to call sns.kdeplot once. Take a look at the hue parameter and adjust other inputs as needed. After you get your plot working, experiment by setting fill=True in kdeplot to see the difference between the shaded and unshaded versions. Please submit your work with fill=False . In [60]: # Set the figure size for the plot plt . figure(figsize = ( 12 , 8 )) sns . kdeplot(data = daily_counts, x = 'casual' , y = 'registered' , hue = 'workingday' ) plt . title( 'Bivariate KDE Plot Comparison of Registered vs Casual Riders' ); 13
14
0.0.8 Question 4b With some modification to your 4a code (this modification is not in scope), we can generate the plot above. In your own words, describe what the lines and the color shades of the lines signify about the data. What does each line and color represent? Hint : You may find it helpful to compare it to a contour or topographical map as shown here . Every line or contour on the graph delineates a segment of a probability density function, and the varying shades indicate distinct density levels between these segments. Darker shades correspond to regions with higher density values. 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16
0.0.9 Question 4c What additional details can you identify from this contour plot that were diffcult to determine from the scatter plot? Identifying regions of high density in the data is considerably more straightforward when using a density plot as opposed to an overcrowded scatter plot. 17
18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.1 5: Joint Plot As an alternative approach to visualizing the data, construct the following set of three plots where the main plot shows the contours of the kernel density estimate of daily counts for registered and casual riders plotted together, and the two “margin” plots (at the top and right of the figure) provide the univariate kernel density estimate of each of these variables. Note that this plot makes it harder to see the linear relationships between casual and registered for the two different conditions (weekday vs. weekend). You should be making use of daily_counts . Hints : * The seaborn plotting tutorial has examples that may be helpful. * Take a look at sns.jointplot and its kind parameter. * set_axis_labels can be used to rename axes on a seaborn plot. For example, if we wanted to plot a scatterplot with ‘Height’ on the x-axis and ‘Weight’ on the y-axis from some dataset stats_df , we could write the following: graph = sns.scatterplot(data=stats_df, x='Height', y='Weight') graph.set_axis_labels("Height (cm)", "Weight (kg)") Note : * At the end of the cell, we called plt.suptitle to set a custom location for the title. * We also called plt.subplots_adjust(top=0.9) in case your title overlaps with your plot. In [61]: ... plt . suptitle( "KDE Contours of Casual vs Registered Rider Count" ) plt . subplots_adjust(top =0.9 ); <Figure size 2400x1200 with 0 Axes> 19
20
0.2 6: Understanding Daily Patterns 0.2.1 Question 6a Let’s examine the behavior of riders by plotting the average number of riders for each hour of the day over the entire dataset (that is, bike DataFrame ), stratified by rider type. Your plot should look like the plot below. While we don’t expect your plot’s colors to match ours exactly, your plot should have a legend in the plot and different colored lines for different kinds of riders, in addition to the title and axis labels. In [62]: average_counts = bike . groupby( 'hr' )[[ 'casual' , 'registered' ]] . agg(np . mean) sns . lineplot(data = average_counts, x = 'hr' , y = 'casual' , label = 'casual' ) sns . lineplot(data = average_counts, x = 'hr' , y = 'registered' , label = 'registered' ) plt . xlabel( 'Hour of the Day' ) plt . ylabel( 'Average Count' ) plt . title( 'Average Count of Casual vs. Registered by Hour' ) plt . legend() Out[62]: <matplotlib.legend.Legend at 0x7f3dd04fe450> 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.2 Question 6b What can you observe from the plot? Discuss your observations and hypothesize about the meaning of the peaks in the registered riders’ distribution. Registered riders show peak bike riding activity during early mornings around 8 AM and in the evenings, particularly between 4-5 PM. Similarly, casual riders also exhibit a peak in bike riding during the evening at approximately 5 PM. Broadly speaking, early mornings and late afternoons/evenings emerge as the prime periods for bike riding, aligning with common work commute hours—either departing for work or returning from work. Notably, the distribution line for casual riders appears more flattened compared to registered riders, indicating their less frequent bike riding habits. 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.3 Question 7b In our case, with the bike ridership data, we want 7 curves, one for each day of the week. The x-axis will be the temperature (as given in the 'temp' column), and the y-axis will be a smoothed version of the proportion of casual riders. You should use statsmodels.nonparametric.smoothers_lowess.lowess just like the example above. Un- like the example above, plot ONLY the lowess curve. Do not plot the actual data, which would result in overplotting. For this problem, the simplest way is to use a loop. You do not need to match the colors on our sample plot as long as the colors in your plot make it easy to distinguish which day they represent. Hints: * Start by plotting only one day of the week to make sure you can do that first. Then, consider using a for loop to repeat this plotting operation for all days of the week. • The lowess function expects the y coordinate first, then the x coordinate. You should also set the return_sorted field to False . You will need to rescale the normalized temperatures stored in this dataset to Fahrenheit values. Look at the section of this notebook titled ‘Loading Bike Sharing Data’ for a description of the (normalized) temperature field to know how to convert back to Celsius first. After doing so, convert it to Fahrenheit. By default, the temperature field ranges from 0.0 to 1.0. In case you need it, Fahrenheit = Celsius × 9 5 + 32 . Note: If you prefer plotting temperatures in Celsius, that’s fine as well! Just remember to convert accordingly so the graph is still interpretable. In [69]: from statsmodels.nonparametric.smoothers_lowess import lowess plt . figure(figsize = ( 10 , 8 )) # BEGIN SOLUTION for day in bike[ 'weekday' ] . unique(): this_day = bike[bike[ 'weekday' ] == day] . copy() this_day[ 'temp' ] = this_day[ 'temp' ] * 41 * 9 / 5 + 32 ysmooth = lowess(this_day[ 'prop_casual' ], this_day[ 'temp' ], return_sorted = False ) sns . lineplot(x = this_day[ 'temp' ], y = ysmooth, label = day) plt . title( "Temperature vs Casual Rider Proportion by Weekday" ) plt . xlabel( "Temperature (Fahrenheit)" ) plt . ylabel( "Casual Rider Proportion" ) plt . legend(); # plt.savefig("images/curveplot_temp_prop_casual", bbox_inches='tight', dpi=300); # END SOLUTION 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.4 Question 7c What do you observe in the above plot? How is prop_casual changing as a function of temperature? Do you notice anything else interesting? With rising temperatures, there’s a notable uptick in the proportion of casual riders. This effect is particu- larly pronounced on weekends, where the trendline sharply ascends with increasing temperature. It’s likely attributed to the fact that weekends provide an ideal opportunity for casual riders to go biking, especially during warmer weather, possibly due to their time off from work or having more leisure hours. 27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.5 Question 8a Imagine you are working for a bike-sharing company that collaborates with city planners, transportation agencies, and policymakers in order to implement bike-sharing in a city. These stakeholders would like to reduce congestion and lower transportation costs. They also want to ensure the bike-sharing program is implemented equitably. In this sense, equity is a social value that informs the deployment and assessment of your bike-sharing technology. Equity in transportation includes: Improving the ability of people of different socio-economic classes, gen- ders, races, and neighborhoods to access and afford transportation services and assessing how inclusive transportation systems are over time. Do you think the bike data as it is can help you assess equity? If so, please explain. If not, how would you change the dataset? You may discuss how you would change the granularity, what other kinds of variables you’d introduce to it, or anything else that might help you answer this question. Note : There is no single “right” answer to this question – we are looking for thoughtful reflection and commentary on whether or not this dataset, in its current form, encodes information about equity. The current bike dataset lacks suffcient context for assessing equity. To enhance its capacity in this regard, we should consider augmenting the dataset with additional personal information about the users, such as gender, race, and place of origin. Incorporating these variables can serve as an initial step towards conducting comprehensive analyses on equity within the bike-sharing system 29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.6 Question 8b Bike sharing is growing in popularity , and new cities and regions are making efforts to implement bike- sharing systems that complement their other transportation offerings. The goals of these efforts are to have bike sharing serve as an alternate form of transportation in order to alleviate congestion, provide geographic connectivity, reduce carbon emissions, and promote inclusion among communities. Bike-sharing systems have spread to many cities across the country. The company you work for asks you to determine the feasibility of expanding bike sharing to additional cities in the US. Based on your plots in this assignment, would you recommend expanding bike sharing to additional cities in the US? If so, what cities (or types of cities) would you suggest? Please list at least two reasons why, and mention which plot(s) you drew your analysis from. Note : There isn’t a set right or wrong answer for this question. Feel free to come up with your own conclusions based on evidence from your plots! Examining the plot in 5a illustrating the hourly average count of both registered and casual bike riders, a notable pattern emerges. The peak in average rider count for registered riders aligns with typical commute hours at 8 AM and 5 PM. This pattern underscores how biking can present a practical and cost-effcient alternative for mitigating traffc congestion during these busy commute periods. 31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help