final_exam_math208_2021_Q12_sols
.pdf
keyboard_arrow_up
School
McGill University *
*We aren’t endorsed by this school
Course
208
Subject
Industrial Engineering
Date
Apr 3, 2024
Type
Pages
19
Uploaded by SargentBarracuda3621
Question 1 [50 points]
data(midwest) midwest_modified<-midwest %>% select(county,state,popdensity,
popwhite,popblack,
popamerindian,popasian,
popother,inmetro)
The data for this question comes from a modified version of the midwest
dataset from the ggplot
library.
str(midwest_modified)
tibble [437 ×
9] (S3: tbl_df/tbl/data.frame)
$ county : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
$ state : chr [1:437] "IL" "IL" "IL" "IL" ...
$ popdensity : num [1:437] 1271 759 681 1812 324 ...
$ popwhite : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 1465
06 ...
$ popblack : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ...
$ popamerindian: int [1:437] 98 19 35 46 14 65 8 30 8 331 ...
$ popasian : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ...
$ popother : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ...
$ inmetro : int [1:437] 0 0 0 1 0 0 0 0 0 1 ...
midwest_modified %>% slice(1:5) %>% select(county:popblack)
county
<chr>
state
<chr>
popdensity
<dbl>
popwhite
<int>
popblack
<int>
ADAMS
IL
1270.9615
63917
1702
ALEXANDER
IL
759.0000
7054
3496
BOND
IL
681.4091
14477
429
BOONE
IL
1812.1176
29344
127
BROWN
IL
324.2222
5264
547
5 rows
midwest_modified %>% slice(1:5) %>% select(county,popamerindian:popother)
county
popamerindian
popasian
popother
<chr>
<int>
<int>
<int>
ADAMS
98
249
124
ALEXANDER
19
48
9
BOND
35
16
34
BOONE
46
150
1139
BROWN
14
5
6
5 rows
The dataset contains population data from midwest counties in five states in the United States from an
unspecified year. There are identifying variables for both the county
(the name) and the state
(the postal
abbreviation). The variable popdensity
is a measure of density (population per unspecified area units). The
variable inmetro
is equal to 1 if the county is classified as a metropolitan area and 0 otherwise. The other
variables contain counts of population size within self-identified racial classifications.
CONTINUED ON NEXT PAGE
a. [5 pts]
Write a line of code that will generate the following tibble
(or data.frame
) containing the
highest population density from each state:
state
<chr>
Highest_Pop_Den
<dbl>
IL
88018.40
IN
34659.09
MI
60333.91
OH
54313.08
WI
63951.67
5 rows
Solution:
midwest_modified %>% group_by(state) %>% summarise(Highest_Pop_Den=max(popdensity))
state
<chr>
Highest_Pop_Den
<dbl>
IL
88018.40
IN
34659.09
MI
60333.91
OH
54313.08
WI
63951.67
5 rows
b. [5 pts]
Write a line of code that adds a new column to the midwest_modified
tibble called Metro
where the elements of that column are equal to a string “Metro” if inmetro
is equal to 1 and
“NonMetro” if inmetro
is equal to 0. The first five rows are given below for the county
, state
,
inmetro
and Metro
columns:
county
<chr>
state
<chr>
inmetro
<int>
Metro
<chr>
ADAMS
IL
0 NonMetro
ALEXANDER
IL
0 NonMetro
BOND
IL
0 NonMetro
BOONE
IL
1 Metro
BROWN
IL
0 NonMetro
5 rows
Solution:
midwest_modified<-midwest_modified %>% mutate(Metro=ifelse(inmetro==1,"Metro","NonMet
ro"))
head(midwest_modified %>% select(county, state,inmetro,Metro),5)
county
<chr>
state
<chr>
inmetro
<int>
Metro
<chr>
ADAMS
IL
0 NonMetro
ALEXANDER
IL
0 NonMetro
BOND
IL
0 NonMetro
BOONE
IL
1 Metro
BROWN
IL
0 NonMetro
5 rows
c. [5 pts]
Write a line of code that will generate the following tibble
(or data.frame
) containing the
highest population density from each state for metropolitan and non-metropolitan counties separately,
using the modified tibble from part (b).
dens_table
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
state
<chr>
Metro
<chr>
Highest_Pop_Den
<dbl>
IL
Metro
88018.397
IL
NonMetro
2309.320
IN
Metro
34659.087
IN
NonMetro
3090.375
MI
Metro
60333.914
MI
NonMetro
2250.645
OH
Metro
54313.077
OH
NonMetro
5484.214
WI
Metro
63951.667
WI
NonMetro
2343.750
1-10 of 10 rows
Solution:
dens_table<-midwest_modified %>% group_by(state,Metro) %>% summarise(Highest_Pop_Den
=max(popdensity))
dens_table
state
<chr>
Metro
<chr>
Highest_Pop_Den
<dbl>
IL
Metro
88018.397
IL
NonMetro
2309.320
IN
Metro
34659.087
IN
NonMetro
3090.375
MI
Metro
60333.914
MI
NonMetro
2250.645
OH
Metro
54313.077
OH
NonMetro
5484.214
WI
Metro
63951.667
WI
NonMetro
2343.750
1-10 of 10 rows
CONTINUED ON NEXT PAGE
d. [5 pts]
Assume the tibble from part (c) is called dens_table
as above. Now write a line of code that
produces a tibble which arranges the data above so that we have separate columns for “Metro” and
“NonMetro”, as below:
state
<chr>
Metro
<dbl>
NonMetro
<dbl>
IL
88018.40
2309.320
IN
34659.09
3090.375
MI
60333.91
2250.645
OH
54313.08
5484.214
WI
63951.67
2343.750
5 rows
Solution:
result_tibble<-dens_table %>% pivot_wider(id_cols=state,names_from=Metro,
values_from=Highest_Pop_Den)
head(result_tibble)
state
<chr>
Metro
<dbl>
NonMetro
<dbl>
IL
88018.40
2309.320
IN
34659.09
3090.375
MI
60333.91
2250.645
OH
54313.08
5484.214
WI
63951.67
2343.750
5 rows
Now we will work with only a modified version of the population counts for each county.
e. [5 pts]
Write a line of code to add a new variable to the data frame named HighDens
which is equal to
“High” if the population density for the county is higher than 1500 and “Not High” if the population
density for the county is lower than 1500. Below are the first 5 rows of the data for the county
,
popdensity
and HighDens
columns:
county
<chr>
popdensity
<dbl>
HighDens
<chr>
ADAMS
1270.9615 NotHigh
ALEXANDER
759.0000 NotHigh
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
BOND
681.4091 NotHigh
BOONE
1812.1176 High
BROWN
324.2222 NotHigh
5 rows
Then we will compute the total number of people in each combination of state
, inmetro
and HighDens
using the code below:
pop_xtabs<-xtabs(
I(popwhite+popblack+popamerindian+popasian+popother)~
state+Metro+HighDens,data=midwest_modified)
pop_xtabs
, , HighDens = High
Metro
state Metro NonMetro
IL 9323624 405933
IN 3728008 689565
MI 7697643 354081
OH 8811604 1078957
WI 3004347 386892
, , HighDens = NotHigh
Metro
state Metro NonMetro
IL 250175 1450870
IN 234438 892148
MI 0 1243573
OH 98555 857999
WI 326825 1173705
CONTINUED ON NEXT PAGE
f. [5 pts]
What will the code pop_xtabs["IL",1,2]
return as output?
Solution:
pop_xtabs["IL",1,2]
[1] 250175
g. [5 pts]
Using only the pop_xtabs
object above, write a line of code to find the total number of people
in areas high density (i.e.
HighDens
is “High”) as below:
High NotHigh 35480654 6528288 Solution:
apply(pop_xtabs,3,sum)
High NotHigh 35480654 6528288 h. [10 pts]
Using only the pop_xtabs
object above, write a line of code that computes the total
population in the combination of State
and HighDens
to return the output below:
HighDens
state High NotHigh
IL 9729557 1701045
IN 4417573 1126586
MI 8051724 1243573
OH 9890561 956554
WI 3391239 1500530
Solution:
apply(pop_xtabs,c(1,3),sum)
HighDens
state High NotHigh
IL 9729557 1701045
IN 4417573 1126586
MI 8051724 1243573
OH 9890561 956554
WI 3391239 1500530
i. [5 pts]
Using only the pop_xtabs
object above, write a line of code (or multiple lines of code) that
computes the percentage of individuals in High
and Low
density in each state as below:
HighDens
state High NotHigh
IL 85.11850 14.881500
IN 79.67977 20.320233
MI 86.62148 13.378518
OH 91.18149 8.818511
WI 69.32541 30.674588
Solution:
apply(pop_xtabs,c(1,3),sum)/apply(pop_xtabs,c(1),sum)*100
HighDens
state High NotHigh
IL 85.11850 14.881500
IN 79.67977 20.320233
MI 86.62148 13.378518
OH 91.18149 8.818511
WI 69.32541 30.674588
##OR
## prop.table(apply(pop_xtabs,c(1,3),sum),1)*100
END OF QUESTION 1
Question 2 [50 points]
We will re-use the same midwest_modified
data that was used in Question 1, with all the modifications
from the other question parts. The description is repeated below for your convenience.
str(midwest_modified)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
spec_tbl_df [437 ×
11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ county : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
$ state : chr [1:437] "IL" "IL" "IL" "IL" ...
$ popdensity : num [1:437] 1271 759 681 1812 324 ...
$ popwhite : num [1:437] 63917 7054 14477 29344 5264 ...
$ popblack : num [1:437] 1702 3496 429 127 547 ...
$ popamerindian: num [1:437] 98 19 35 46 14 65 8 30 8 331 ...
$ popasian : num [1:437] 249 48 16 150 5 ...
$ popother : num [1:437] 124 9 34 1139 6 ...
$ inmetro : num [1:437] 0 0 0 1 0 0 0 0 0 1 ...
$ Metro : chr [1:437] "NonMetro" "NonMetro" "NonMetro" "Metro" ...
$ HighDens : chr [1:437] "NotHigh" "NotHigh" "NotHigh" "High" ...
- attr(*, "spec")=
.. cols(
.. county = col_character(),
.. state = col_character(),
.. popdensity = col_double(),
.. popwhite = col_double(),
.. popblack = col_double(),
.. popamerindian = col_double(),
.. popasian = col_double(),
.. popother = col_double(),
.. inmetro = col_double(),
.. Metro = col_character(),
.. HighDens = col_character()
.. )
- attr(*, "problems")=<externalptr> midwest_modified %>% slice(1:5) %>% select(county:popblack)
county
<chr>
state
<chr>
popdensity
<dbl>
popwhite
<dbl>
popblack
<dbl>
ADAMS
IL
1270.9615
63917
1702
ALEXANDER
IL
759.0000
7054
3496
BOND
IL
681.4091
14477
429
BOONE
IL
1812.1176
29344
127
BROWN
IL
324.2222
5264
547
5 rows
midwest_modified %>% slice(1:5) %>% select(county,popamerindian:HighDens)
county
<chr>
popamerindian
<dbl>
popasian
<dbl>
popother
<dbl>
inmetro
<dbl>
Metro
<chr>
HighDens
<chr>
ADAMS
98
249
124
0 NonMetro
NotHigh
ALEXANDER
19
48
9
0 NonMetro
NotHigh
BOND
35
16
34
0 NonMetro
NotHigh
BOONE
46
150
1139
1 Metro
High
BROWN
14
5
6
0 NonMetro
NotHigh
5 rows
The dataset contains population data from midwest counties in five states in the United States from an
unspecified year. There are identifying variables for both the county
(the name) and the state
(the postal
abbreviation). The variable popdensity
is a measure of density (population per unspecified area units). The
variable inmetro
is equal to 1 if the county is classified as a metropolitan area and 0 otherwise. The other
variables contain counts of population size within self-identified racial classifications.
CONTINUED ON NEXT PAGE
a. [6 pts]
Below are partially obscured code and three plots of the values of the log (base 10)
of the
population density for all counties:
p1<-ggplot(midwest_modified,aes(x=popdensity)) + geom_XXXXX(nbins=30,fill="white",col
="black") + ggtitle("Plot 1") + theme_bw() + scale_x_log10()
p2<-ggplot(midwest_modified,aes(x=popdensity)) + geom_YYYYY() + ggtitle("Plot 2") + theme_bw()+ scale_x_log10()
p3<-ggplot(midwest_modified,aes(x=popdensity)) + geom_ZZZZZ() + ggtitle("Plot 3") + theme_bw()+ scale_x_log10()
grid.arrange(grobs=list(p1,p2,p3),nrow=3,ncol=1)
CONTINUED ON NEXT PAGE
Identify these three plots by name:
Plot 1 Plot 2 Plot 3
Solution:
Plot 1: Histogram Plot 2: Density plot Plot 3: Boxplot
b. [10 pts]
Now we make the same plots, but for each state. Do you believe there is evidence of an
association between state and population density? In particular, do we see di
ff
erences in the
distributions of population density by state?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
grid.arrange(grobs=list(p2,p3),nrow=2,ncol=1)
Solution:
Yes, there is moderate evidence of an association. In particular, Ohio seems to have more counties
with higher population densities than the other states. We also see that the spread for Michigan is much larger
than the other states.
CONTINUED ON NEXT PAGE
c. [4 pts]
Which plot(s) do you think best shows the association between state and population density?
Which plot(s) do you think does not shows the association between state and population density as
clearly? Explain your answer and reasoning in a few sentences.
Solution:
The boxplots and histograms probably are the best. The boxplots show the di
ff
erences in central
location better; the histograms show the di
ff
erences in spread and shape better. The density plot is less useful
since it is very busy and it is influenced by the outliers.
d. [5 pts]
Which of the following plots could be used to assess the association between the popwhite
and popblack
variables? List all that apply (or say None if none would be appropriate).
A. 2-d density plot B. Barplot C. Boxplot D. 2-d histogram
Solution:
A, D
We now would like to make plots to take a di
ff
erent look at the population variables. Unfortunately, the format
of the midwest_modified
data needs to be further changed so that we can use it in a ggplot
.
e. [5 pts]
Write a line of code that will create a new tibble
converts the midwest_modified_new
to
“long” format where each row contains a population count for a specific racial group called Count
, and
the variable from where that count originated (e.g.
popwhite
) as well as the state
, county
, and
Metro
information for that population group. You should not include the columns for HighDens
,
inmetro
or popdensity
. The first 10 rows of the new tibble are below
midwest_modified_new %>% slice(1:10)
county
<chr>
state
<chr>
Metro
<chr>
Race_Variable
<chr>
Count
<dbl>
ADAMS
IL
NonMetro
popwhite
63917
ADAMS
IL
NonMetro
popblack
1702
ADAMS
IL
NonMetro
popamerindian
98
ADAMS
IL
NonMetro
popasian
249
ADAMS
IL
NonMetro
popother
124
ALEXANDER
IL
NonMetro
popwhite
7054
ALEXANDER
IL
NonMetro
popblack
3496
ALEXANDER
IL
NonMetro
popamerindian
19
ALEXANDER
IL
NonMetro
popasian
48
ALEXANDER
IL
NonMetro
popother
9
1-10 of 10 rows
Solution:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
midwest_modified_new<-midwest_modified %>% select(-popdensity,-inmetro,-HighDens) %>%
pivot_longer(cols=popwhite:popother,names_to="Race_Variable",
values_to="Count")
CONTINUED ON NEXT PAGE
Below is a figure along with the code (partially obscured) which generated it.
ggplot(midwest_modified_new,aes(x=______,fill=______,y=______,
)) +
geom_XXXXXX(stat="identity") + ggtitle("Plot f") + theme_bw()
f. [5 pts]
What are the missing geometry and aesthetics that generated the figure on the previous page
(that is, what are the words that are missing in the code above for Plot f)?
Solution:
geom_bar
is the geometry. The aesthetics are: aes(x=Metro, fill=Race_Variable,y=Count)
.
g. [5 pts]
Note that the plot in part (f) is a bit di
ffi
cult to use because it contains the counts, rather than the
relative proportions. Write a line of code (or lines of code) to create a new tibble called
metro_race_summaries
which contains each racial population count and proportion relative to the
level of the Metro variable as below:
metro_race_summaries
Metro
<chr>
Race_Variable
<chr>
Race_Count
<dbl>
Proportion
<dbl>
Metro
popamerindian
99145
0.002961743
Metro
popasian
538463
0.016085421
Metro
popblack
4672825
0.139590573
Metro
popother
668449
0.019968473
Metro
popwhite
27496337
0.821393790
NonMetro
popamerindian
50794
0.005952150
NonMetro
popasian
34210
0.004008801
NonMetro
popblack
144611
0.016945828
NonMetro
popother
36402
0.004265665
NonMetro
popwhite
8267706
0.968827556
1-10 of 10 rows
Solution:
metro_race_summaries<-midwest_modified_new %>% group_by(Metro,Race_Variable) %>% summarise(Race_Count = sum(Count)) %>% ungroup() %>% group_by(Metro) %>% mutate(Proportion=Race_Count/sum(Race_Count))
h. [5 pts]
Using the tibble from (g), write a line of code that created the barplot below.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Solution:
ggplot(metro_race_summaries,
aes(x=Metro,fill=Race_Variable,y=Proportion,)) +
geom_bar(stat="identity") + ggtitle("Plot g") + theme_bw()
i. [5 pts]
Based on the plot in part (h), would you conclude that there the population distribution of race
varies between Metro and NonMetro areas? Explain your answer in a few sentences.
Solution:
Absolutely, there are substantially more white people in the nonMetro areas and fewer non-white people.