hw4

pdf

School

University of Oregon *

*We aren’t endorsed by this school

Course

102

Subject

Statistics

Date

Apr 27, 2024

Type

pdf

Pages

8

Report

Uploaded by MajorKookaburaMaster1051

hw4 April 26, 2024 [ ]: import otter grader = otter . Notebook() 1 Homework 4: Advanced operations in pandas Due Date: 11:59PM on the date posted to Canvas Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually . If you do discuss the assignments with other students please include their names below. Collaborators: list collaborators here Grading Grading is broken down into autograded answers and free response. For autograded answers, the results of your code are compared to provided and/or hidden tests. For autograded probability questions, the provided tests will only check that your answer is within a reasonable range. For free response, readers will evaluate how well you answered the question and/or fulfilled the requirements of the question. For plots, make sure to be as descriptive as possible: include titles, axes labels, and units wherever applicable. [ ]: import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns 'imports completed' 1.1 Introduction The purpose of this module is to expand your ‘pandas’ skillset by performing various new and old operations on ‘pandas’ dataframes. A lot of these operations will be things you’ve done before in the datascience package, so you should reference the included notebook to translate between the two if need be. 1
You are expected to answer all relevant questions programatically i.e. use indexing and func- tions/methods to arrive to your answers. Your answers don’t need to be in one single line, you may use as many intermediate steps as you need. 1.1.1 Question 1 Reading in data from file is made easy in the pandas package. We have included two datasets in your assignment folder to read in, ‘broadway.csv’ and ‘diseases.txt’. Question 1.1 Read in broadway using pd.read_csv . [ ]: broadway = ... broadway . head( 6 ) [ ]: grader . check( "q1_1" ) Question 1.2 Now read in the diseases dataset. Diseases is not a .csv but a .txt file i.e. a plain- text file. Because it’s not .csv , we can’t assume that the values are comma separated. Fortunately pd.read_csv can be used on any file. It may not parse the data correctly, but it may reveal the values that do separate entries. Identify the separator used in diseases.txt and use it to successfully read in your data with pd.read_csv . [ ]: separator = ... diseases = pd . read_csv( "diseases.txt" , sep = ... ) diseases . head( 6 ) [ ]: grader . check( "q1_2" ) Question 1.3 Read in the the DataFrame called nst-est2016-alldata.csv from the course Github. The url path to the repository is https://github.com/oregon-data- science/DSCI101/raw/main/data/. You should do this with pd.read_csv . [ ]: pop_census = ... [ ]: grader . check( "q1_3" ) This DataFrame gives census-based population estimates for each state on both July 1, 2015 and July 1, 2016. The last four columns describe the components of the estimated change in population during this time interval. For all questions below, assume that the word “states” refers to all 52 rows including Puerto Rico & the District of Columbia. The data was taken from here . If you want to read more about the different column descriptions, click here ! The raw data is a bit messy - run the cell below to clean the DataFrame and make it easier to work with. 2
[ ]: # Don't change this cell; just run it. pop_sum_level = pop_census[ 'SUMLEV' ] == 40 pop = pop_census[pop_sum_level] # grab a numbered list of columns to use columns_to_use = pop . columns[[ 1 , 4 , 12 , 13 , 27 , 34 , 62 , 69 ]] pop = pop[columns_to_use] pop = pop . rename(columns = { 'POPESTIMATE2015' : '2015' , 'POPESTIMATE2016' : '2016' , 'BIRTHS2016' : 'BIRTHS' , 'DEATHS2016' : 'DEATHS' , 'NETMIG2016' : 'MIGRATION' , 'RESIDUAL2016' : 'OTHER' }) #pop['REGION'].unique() pop[ 'REGION' ] = pop[ 'REGION' ] . replace({ '1' : 1 , '2' : 2 , '3' : 3 , '4' : 4 , 'X' : 0 }) pop . head( 12 ) 1.1.2 Question 2 - Census data Question 2.1 Assign us_birth_rate to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the total number of births in that period as a proportion of the population size at the start of the time period. Hint: Which year corresponds to the start of the time period? [ ]: us_birth_rate = ... us_birth_rate [ ]: grader . check( "q2_1" ) Question 2.2 Assign movers to the number of states for which the absolute value ( np.abs ) of the annual rate of migration was higher than 1%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population size at the start of the period. The MIGRATION column contains estimated annual net migration counts by state. [ ]: ... movers = ... movers [ ]: grader . check( "q2_2" ) Question 2.3 Assign west_births to the total number of births that occurred in region 4 (the Western US). Hint: Make sure you double check the type of the values in the region column, and appropriately filter (i.e. the types must match!). 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[ ]: west_births = ... west_births [ ]: grader . check( "q2_3" ) Question 4. Assign less_than_west_births to the number of states that had a total population in 2016 that was smaller than the total number of births in region 4 (the Western US) during this time interval. [ ]: less_than_west_births = ... less_than_west_births [ ]: grader . check( "q2_4" ) Question 2.5 In the next question, you will be creating a visualization to understand the relationship between birth and death rates. The annual death rate for a year-long period is the total number of deaths in that period as a proportion of the population size at the start of the time period. What visualization is most appropriate to see if there is an association between birth and death rates during a given time interval? 1. Line Graph 2. Scatter Plot 3. Bar Chart Assign visualization below to the number corresponding to the correct visualization. [ ]: visualization = ... [ ]: grader . check( "q2_5" ) Question 2.6 In the code cell below, create a visualization that will help us determine if there is an association between birth rate and death rate during this time interval. It may be helpful to create an intermediate DataFrame here. [ ]: # Generate your chart in this cell ... 1.1.3 Question 3 - The diseases dataset The U.S., as in many places, was once afflicted by many diseases (many of them viruses) that are no longer prominent today due to the advent of vaccines. Some of them such as Polio have been effectively eradicated while others like Measles affect so few individuals that they are largely irrelevant in the public health landscape. Notably, even though many of these diseases persist in the population ( e.g. measles, mumps and rubella), they are suffciently diluted by uninfected and/or vaccinated individuals to undermine any potential for an outbreak. Question 3.1 How many different diseases are represented in this dataset? 4
[ ]: num_diseases = ... [ ]: grader . check( "q3_1" ) Question 3.2 We have disease prevalence in terms of total individuals infected in a year in a state. The absolute magnitude of infected individuals can be helpful, but it’ll be easier to directly compare between diseases and states if we weight these values by total population. Create a new column in diseases called “incidence_per” representing the disease incidence (“number”) as a percent of the state’s population. Hint : If the variable is represented as a percent, then it should be between 0 and 100. [ ]: diseases[ "incidence_per" ] = ... [ ]: grader . check( "q3_2" ) Question 3.3 Using this new column you created, identify the disease that afflicted the greatest percentage of New York’s population in 1928. Provide your answer as a string. [ ]: ... worst_ny_disease_1928 = ... worst_ny_disease_1928 [ ]: grader . check( "q3_3" ) Question 3.4 Between the years 1928 and 1938 inclusive, which U.S. state had the highest average incidence of polio as a percentage of its total population? [ ]: ... worst_polio_state = ... worst_polio_state [ ]: grader . check( "q3_4" ) Question 3.5 Identify the first year in which Polio was effectively eradicated in the US (fewer than 100 total cases). [ ]: ... first_year_eradicated = ... first_year_eradicated [ ]: grader . check( "q3_5" ) Measles is a highly infectious viral disease that, historically, was once one of the most prominent childhood illnesses globally. 5
Prior to the development of a vaccine for measles, it was more or less a fact of life for children. The disease was a constant blight that perpetuated itself in large boom-bust cycles of disease outbreaks. However, the first measles vaccine was approved for distribution in 1963, which would have dramatic consequences for the future of measles’ presence in the public-health landscape. The 𝑅 0 of a disease represents how many people we can expect to be infected by a single conta- gious individual under average conditions in a uniformly susceptible population (no vaccinations or aquired immunity). Measles has an 𝑅 0 = 18 - an incredibly high value that indicates it is among the most infectious diseases that affect humans. For reference, the 𝑅 0 for a typical year’s flu is 1. [ ]: measles_sum = diseases[diseases[ "disease" ] == "MEASLES" ] . groupby([ "year" ]) . sum( "number" ) . reset_index() sns . lineplot(data = measles_sum, x = "year" , y = "number" ) plt . ylabel( "Number of Cases (US)" ) plt . axvline(x = 1963 , color = "black" , linestyle = "dashed" ); Clearly the MMR vaccine was incredibly successful at reducing and eventually eliminating Measles outbreaks. 1.1.4 Question 4 - The broadway dataset The broadway dataset contains all plays put into production on Broadway between the years 1990 and 2016. [ ]: print ( f"Over this time period there were { len (broadway[ 'Show.Name' ] . unique()) } different shows put on Broadway." ) That’s a lot of shows! Presumably there were some hits and some duds. Let’s separate the wheat from the chaff and identify those shows that performed the best. But how do we define best? Question 4.1 Create a Series of plays in order of most to least total gross. [ ]: broadway_grosses = ... [ ]: grader . check( "q4_1" ) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 4.2 Now create a Series of plays in order of most to least average amount grossed per seat filled (Gross / Attendance) [ ]: ... broadway_gross_seat = ... broadway_gross_seat [ ]: grader . check( "q4_2" ) Question 4.3 Create a new variable representing date as a single continuous variable. This should combine year, month and day into a new column. Assume no leap years and that that there are 30.44 days per month. Call this variable date_continuous Hint : Think about how you can convert months into days and days into the same units as years. [ ]: broadway[ "date_continuous" ] = ... broadway[[ "Show.Name" , "date_continuous" ]] [ ]: grader . check( "q4_3" ) With this variable created, we can now identify the show that has had the longest tenure on Broadway. To do this, we’ll define our own function called span , which will return the difference between the max and minimum of a series. Using this function, we can find the length of time each show spent on Broadway and identify the longest running plays. [ ]: def span (series): return max (series) - min (series) [ ]: broadway_length = (broadway[[ "Show.Name" , "date_continuous" ]] . groupby( "Show.Name" ) . agg(span) . reset_index() . rename({ "date_continuous" : "total_tenure_years" }, axis = 1 ) . sort_values( "total_tenure_years" , ascending = False ) ) broadway_length . head() Question 4.4 This is some handy information and we might find it useful to include the to- tal tenure of a show in the original dataframe. Join the total tenure you just determined on the original broadway frame using the merge function, ensuring that the new column is called “to- tal_tenure_years”. Be sure that there is no information lost from the original dataframe in your new, joined dataframe. You should reference the help file for merge if you need guidance ( help(pd.merge) ). [ ]: broadway_merged = ... broadway_merged . head() 7
[ ]: grader . check( "q4_4" ) 1.2 Submission Make sure you have run all cells in your notebook in order. Then execute the following two commands from the File menu: • Save and Checkpoint • Close and Halt Then upload your .ipynb file to Canvas assignment HW4 To double-check your work, the cell below will rerun all of the autograder tests. [ ]: grader . check_all() 8