Homework 9
pdf
School
University of Illinois, Urbana Champaign *
*We aren’t endorsed by this school
Course
STAT420
Subject
Industrial Engineering
Date
Apr 3, 2024
Type
Pages
21
Uploaded by SargentWaterHorse4
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
1/21
Homework 9
Wenbin Nie
Due 11/2/2023
Homework Instructions
Make sure to add your name to the header of the document. When submitting the assignment on
Gradescope, be sure to assign the appropriate pages of your submission to each Exercise.
The point value for each exercise is noted in the exercise title.
For questions that require code, please create or use the code chunk directly below the question and type your
code there. Your knitted pdf will then show both the code and the output, so that we can assess your
understanding and award any partial credit.
For written questions, please provide your answer after the indicated Answer
prompt.
You are encouraged to knit your file as you work, to check that your coding and formatting are done so
appropriately. This will also help you identify and locate any errors more easily.
Homework Setup
We’ll use the following packages for this homework assignment. We’ll also read in data from a csv file. To
access the data, you’ll want to download the dataset from Canvas and place it in the same folder as this R
Markdown document. You’ll then be able to use the following code to load in the data.
library
(ggplot2)
library
(faraway)
library
(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library
(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
## ## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
## ## logit, vif
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
2/21
Exercise 1: Formatting [5 points]
The first five points of the assignment will be earned for properly formatting your final document. Check that
you have:
included your name on the document
properly assigned pages to exercises on Gradescope
selected page 1 (with your name)
and this page for this exercise (Exercise 1)
all code is printed and readable for each question
all output is printed
generated a pdf file
Exercise 2: Scottish Hill Races [30 points]
For this exercise, we’ll use the races.table
dataset that includes information on record-winning times
(minutes) for 35 hill races in Scotland, as reported by Atkinson (1986). The additional variables record the
overall distance travelled (miles) and the height climbed in the race. Below, we are reading in the data from an
online source. We do correct one error reported by Atkinson before beginning our analysis and adjust the
height climbed to be recorded in thousands of feet.
Source: Atkinson, A. C. (1986). Comment: Aspects of diagnostic regression analysis (discussion of paper by
Chatterjee and Hadi). Statistical Science
, 1
, 397-402.
url = 'http://www.statsci.org/data/general/hills.txt' races.table = read.table(url, header=TRUE, sep='\t')
races.table[18,4] = 18.65
races.table$Climb = races.table$Climb / 1000
head(races.table)
Race
<chr>
Distance
<dbl>
Climb
<dbl>
Time
<dbl>
1
Greenmantle
2.5
0.650
16.083
2
Carnethy
6.0
2.500
48.350
3
CraigDunain
6.0
0.900
33.650
4
BenRha
7.5
0.800
45.600
5
BenLomond
8.0
3.070
62.267
6
Goatfell
8.0
2.866
73.217
6 rows
part a
Create a scatterplot matrix of the quantitative variables contained in the race.table
dataset. Interpret this
scatterplot matrix. What variable do you think will be more important in predicting the record time of that race?
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
3/21
# Use this code chunk for your answer.
for_matrix1 = races.table[,2:4] pairs(for_matrix1)
Answer:
It seems like Distance is more important
part b
Fit a multiple regression model predicting the record time of a race from the distance travelled, the height
climbed, and an interaction of the two variables. Report the summary of the model. What is the for this
model? What does this suggest about the strength of the model?
# Use this code chunk for your answer.
lm1 = lm (Time ~ Distance * Climb, data = races.table) summary(lm1)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
4/21
## ## Call:
## lm(formula = Time ~ Distance * Climb, data = races.table)
## ## Residuals:
## Min 1Q Median 3Q Max ## -23.3078 -2.8309 0.7048 2.2312 18.9270 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.3532 3.9122 -0.090 0.928638 ## Distance 4.9290 0.4750 10.377 1.32e-11 ***
## Climb 3.5217 2.3686 1.487 0.147156 ## Distance:Climb 0.6731 0.1746 3.856 0.000545 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 7.35 on 31 degrees of freedom
## Multiple R-squared: 0.9806, Adjusted R-squared: 0.9787 ## F-statistic: 521.1 on 3 and 31 DF, p-value: < 2.2e-16
Answer:
The R^2 value of 0.9806 suggests that the model is strong. It explains a 98.06% of the variance in
record time based on the independent variables, which include “Distance,” “Climb,” and their interaction.
part c
Interpret the first-order coefficient pertaining to Distance
. Then, calculate the slopes for Distance
for a race
whose Climb
is 0.3 (300 feet) and again for an individual whose Climb
is 3 (3000 feet).
# Use this code chunk for your answer.
coef_distance = 4.9290 coef_interaction = 0.6731 climb_value_1 = 0.3
slope_1 = coef_distance + (coef_interaction * climb_value_1)
climb_value_2 = 3
slope_2 = coef_distance + (coef_interaction * climb_value_2)
slope_1
## [1] 5.13093
slope_2
## [1] 6.9483
Answer:
The first-order coefficient pertaining to Distance is 4.929 indicating that for each runner in the race,
their race time will increase by 4.929 minutes for each increase in 1 mile of distance. The slopes for Distance
for a race whose Climb
is 0.3 (300 feet) and again for an individual whose Climb
is 3 (3000 feet) are 5.13093
and 6.9483 respectively.
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
5/21
part d
Identify any influential points as defined in the lecture. Which of these observations, if any, are especially
influential based on their values? For these influential points, do they have high leverage, high standardized
residual, both, or neither?
# Use this code chunk for your answer.
cooks.distance(lm1)[cooks.distance(lm1) > (4/35)]
## 7 11 35 ## 3.758307 2.704165 1.805942
hatvalues(lm1)[hatvalues(lm1) > (8/35)]
## 7 11 33 35 ## 0.5207512 0.7182517 0.2379383 0.3261854
(rstandard(lm1)[abs(rstandard(lm1)) > 2])
## 7 11 35 ## 3.719559 2.059866 -3.862957
Answer:
Observation 7,11 and 35 are influential based on Cook’s distances, leverage and s.r. These
observations exhibit both high leverage and high standardized residuals, which make them particularly
influential in the model.
part e
Refit the model from part b without any points that you identified as influential. Note: this is not something that
we should automatically do, but we will do it for now as a demonstration of how much our model may be
affected by these points! Print the coefficients for this model. How do they compare to the coefficients from the
model in part b?
Hint: Create a subset of your data that only includes those points that are not influential before fitting your data.
# Use this code chunk for your answer.
remove_obs = races.table[-c(7, 11, 35),]
lm2 = lm(Time ~ Distance * Climb, data = remove_obs) summary(lm2)
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
6/21
## ## Call:
## lm(formula = Time ~ Distance * Climb, data = remove_obs)
## ## Residuals:
## Min 1Q Median 3Q Max ## -11.1574 -2.7089 0.3387 2.2074 10.3180 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.6141 3.3490 0.183 0.855828 ## Distance 5.1003 0.6079 8.390 3.98e-09 ***
## Climb 1.8117 1.7531 1.033 0.310273 ## Distance:Climb 0.7105 0.1663 4.272 0.000202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 4.877 on 28 degrees of freedom
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9754 ## F-statistic: 411.1 on 3 and 28 DF, p-value: < 2.2e-16
Answer:
After removing the problematic data points, the new model (lm2) still works well and can predict race
times accurately. The changes in the coefficients show that the problematic points did affect the results, but the
modified model is still reliable and valid for making predictions. We can see that the coef for Distance increase
around 1 and coef. for climb decrease significantly while the interception becomes positvie. ## part f
How much does this updated model affect our actual predictions for the response? Let’s create a scatterplot
that compares our fitted values from our original model to those from our newer model (influential points
removed).
Calculate and save each of the fitted values (for the original model and for the newer model) to their own
named object in R
. Note: If you are using the predict
function, you can supply as an argument
newdata = races.table
since we will use all of the variables and all of the data.
Then, create a dataframe in R
by providing your two named objects with fitted values as two arguments inside
the data.frame
function, and save the result to a new named object in R
.
Now, create a scatterplot to compare the fitted values for each model. Include an appropriate title and axes
labels. All other formatting is optional and up to you!
It might be helpful to add a line with intercept 0 and slope 1 to represent what perfect matching would look like.
Finally, briefly comment on what this plot reveals. Would you say there are big differences in the predictions
made by each model, or would you say the predictions by each model are quite similar? Is this what you would
expect from the results in part d?
# Use this code chunk for your answer.
old = fitted(lm1)
old = old[-c(7, 11, 35)] new = fitted(lm2)
df = data.frame(old, new)
ggplot(df, aes(x = old, y = new)) + geom_point() +
labs(x = 'Old', y = 'New', title = 'Comaprison') +
geom_abline(slope = 1, intercept = 0, color = 'Red')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
7/21
Answer:
the two models are producing nearly the same predictions for the response variable, so I woudn’ t say
that there is a huge difference between two models. This is what I expected since they have similar R^2.
Exercise 3: Hospital SUPPORT Data:
Unusual Observations [29 points]
For this exercise, we will use the data stored in hospital.csv
on Canvas. It contains a random sample of 580
seriously ill hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses
Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to
determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital.
The variables in the dataset are:
Days
- Day to death or hospital discharge
Age
- Age on day of hospital admission
Sex
- Female or male
Comorbidity
- Patient diagnosed with more than one chronic disease
EdYears
- Years of education
Education
- Education level; high or low
Income
- Income level; high or low
Charges
- Hospital charges, in dollars
Care
- Level of care required; high or low
Race
- Non-white or white
Pressure
- Blood pressure, in mmHg
Blood
- White blood cell count, in gm/dL
Rate
- Heart rate, in bpm
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
8/21
part a
Fit a model with Charges
as the response, and with predictors of EdYears
, Pressure
, and Age
.
# Use this code chunk for your answer.
hospital = read.csv("F:\\UIUC\\STAT420\\hospital.csv")
lm3 = lm(Charges ~ Age + EdYears + Pressure, data = hospital) summary(lm3)
## ## Call:
## lm(formula = Charges ~ Age + EdYears + Pressure, data = hospital)
## ## Residuals:
## Min 1Q Median 3Q Max ## -70326 -41609 -26872 5233 477250 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 79567.4 21731.1 3.661 0.000274 ***
## Age -643.0 211.1 -3.047 0.002421 ** ## EdYears 1407.7 906.4 1.553 0.120937 ## Pressure -33.8 126.6 -0.267 0.789536 ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 79640 on 576 degrees of freedom
## Multiple R-squared: 0.02279, Adjusted R-squared: 0.0177 ## F-statistic: 4.478 on 3 and 576 DF, p-value: 0.00405
part b
Calculate the leverages for each observation in the dataset. How many observations have leverages above our
course threshold? Make a histogram of all leverages for the dataset. Does the course threshold seem to fall at
a good cutoff for this model?
# Use this code chunk for your answer.
n = length(resid(lm3))
p = length(coef(lm3))
n
## [1] 580
p
## [1] 4
2 * p/n
## [1] 0.0137931
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
9/21
hatvalues = hatvalues(lm3) above = hatvalues[hatvalues > (2 * p/n)]
above
## 2 10 11 15 19 23 27 ## 0.01596837 0.04627055 0.01570829 0.01542179 0.01583249 0.01598336 0.01543236 ## 57 118 130 131 153 180 198 ## 0.01794143 0.02054003 0.02644400 0.01671699 0.01481928 0.01456018 0.02423647 ## 201 209 222 224 265 282 289 ## 0.01524729 0.01754252 0.01489532 0.01509594 0.01745183 0.01441938 0.01450108 ## 298 317 341 349 368 402 407 ## 0.02035020 0.02056658 0.01486582 0.01575459 0.02041636 0.02053973 0.02234620 ## 423 443 469 474 499 511 513 ## 0.02178170 0.03114276 0.01465574 0.01648916 0.02107845 0.02138013 0.01414533 ## 550 556 575 580 ## 0.01689769 0.01688995 0.01496846 0.01570199
length(above)
## [1] 39
hist(hatvalues, breaks = 20)
Answer:
39 observations above the threshold.Base on the histogram, I think the threshold is reasonable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
10/21
part c
Calculate the standardized residuals for each observation in the dataset. How many observations are
designated as having a high standardized residual based on our course threshold? Generate a histogram of all
standardized residuals for the dataset. What is the shape of this histogram?
# Use this code chunk for your answer.
sr = rstandard(lm3)
above_sr = rstandard(lm3)[abs(rstandard(lm3)) > 2]
hist(rstandard(lm3))
length(above_sr)
## [1] 32
Answer:
32 observations above our course threshold. The shape of the histogram is right skewed
part d
Calculate the Cook’s distance for each observation in the dataset. Print only those observations that are above
the threshold defined in lecture. After looking through these Cook’s distances by eye, the Cook’s distance for
what specific observations, if any, appear to be especially large? Finally, what is Cook’s distance used to
measure?
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
11/21
# Use this code chunk for your answer.
above_c = cooks.distance(lm3)[cooks.distance(lm3) > (4/n)]
above_c
## 2 3 14 15 16 24 ## 0.030335886 0.022467402 0.014247049 0.017506191 0.049720108 0.007418391 ## 26 34 35 38 39 53 ## 0.049569896 0.015919473 0.039607688 0.012293791 0.025586612 0.060476097 ## 58 67 74 75 77 111 ## 0.045286259 0.019589190 0.010380495 0.009795234 0.007830997 0.015021677 ## 191 197 204 205 218 224 ## 0.035650610 0.007857909 0.045482460 0.010155139 0.013588456 0.014673965 ## 249 252 257 290 327 351 ## 0.008886250 0.036517316 0.007009659 0.012064199 0.007414654 0.011625540 ## 368 402 479 ## 0.038139902 0.055833796 0.025653720
hist(cooks.distance(lm3))
Answer:
Observation 53, 58,252 and etc. seem pretty large. Cook’s distance is good at identifying influential
observations
part e
Generate the default plots in R
. Then, interpret each of these plots.
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
12/21
# Use this code chunk for your answer.
plot(lm3)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
13/21
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
14/21
Answer:
In the residuals vs fitted plot, base on the plot I don’t think the linearity assumption is valid. The QQ
plot suggests that the normality assumption is not valid. In the scale-location plot, I don’t think the residuals are
randomly spreaded, but accumulate around the red line. In residuals vs leverage plot, an observation is far
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
15/21
beyond the Cook??s distance lines and other sevearl observations have large cook’s distance too.
part f
In order to assess the fit of this model, calculate the value of the RMSE using leave one out cross validation.
# Use this code chunk for your answer.
sqrt(mean((resid(lm3)/(1-hatvalues(lm3)))^2))
## [1] 79951.01
Exercise 4: Hospital SUPPORT Data, Days
Variable [21 points]
For this exercise, we will continue analyzing the hospital
dataset. We will focus in particular on whether we
should add the Days variable to the model from Question 3 (predicting Charges from EdYears, Pressure, and
Age).
part a
Calculate the measure of collinearity
for the Days variable. What does this information tell us?
# Use this code chunk for your answer.
lm4 = lm(Charges ~ Age + EdYears + Pressure + Days, data = hospital)
vif_lm4 = vif(lm4)
vif_days = vif_lm4["Days"]
vif_days
## Days ## 1.018054
lm_dropday = lm(Charges ~ . - Days, data = hospital)
summary(lm_dropday)$r.squared
## [1] 0.2551996
r2 = 1-1/vif_days
r2
## Days ## 0.01773431
Answer:
This suggests that I can confidently include the Days variable in the model without concerning about
multicollinearity. measure of collinearity tells me the percent of the variation in our drop variable that is
explained by its linear relationship with other predictors.It also tells us how much information is in our drop
vairable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
16/21
part b
In this question, we’ll create the partial correlation coefficient and the variable added plot for adding the Days
variable for the Question 3 model.
To start, create and save the residuals for the two models needed for this calculation. Save both of the
residuals to their own R
objects.
Calculate the partial correlation coefficient for the considered predictor variable Days
. Then, generate the
variable added plot for this considered predictor variable. If you aren’t sure how to create the variable added
plot with R code, refer to the last part of Textbook Section 15.2.1 (just before the end of the section) for a model
of the code. Make sure to include an appropriate title and axes labels.
What do the partial correlation coefficient and the variable added plot indicate about adding the Days
variable
to the model?
# Use this code chunk for your answer.
residuals_no_days = resid(lm3)
residuals_with_days = resid(lm4)
partial_corr_coef = cor(residuals_no_days, residuals_with_days) partial_corr_coef
## [1] 0.7305195
avPlots(lm4)
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
17/21
Answer:
My partial correlation coefficient suggests a moderate to strong positive relationship between the
residuals of the model without the Days variable and the residuals of the model with the Days variable, while
controlling for the other predictors in the model.From the plot we can see the slope being positive and the the
scatters are located around the line. The slop the the line is also relatively steep.
part c
Fit a linear model to the variable added plot. Hint: you can use the residuals directly in the lm
function in the y
and x locations without needing to create a new data frame.
Is the slope for this linear model significantly different from 0? What does that suggest in terms of adding the
Days
variable to the model?
# Use this code chunk for your answer.
lm_variable_added = lm(residuals_with_days ~ residuals_no_days)
summary(lm_variable_added)
## ## Call:
## lm(formula = residuals_with_days ~ residuals_no_days)
## ## Residuals:
## Min 1Q Median 3Q Max ## -325244 -7587 3664 12132 198241 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.637e-11 1.647e+03 0.00 1 ## residuals_no_days 5.337e-01 2.075e-02 25.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 39660 on 578 degrees of freedom
## Multiple R-squared: 0.5337, Adjusted R-squared: 0.5329 ## F-statistic: 661.4 on 1 and 578 DF, p-value: < 2.2e-16
Answer:
The slope coefficient for residuals_no_days is estimated to be approximately 0.5337, and base on the
p-value less than the typical significance level, indicating that the slope is highly significant Overall, it suggests
that adding the Days variable to the model has a substantial impact on the model’s fit.
part d
Generate an ANOVA test between our two models, and print the resulting table. Which model do you prefer,
and what does that indicate about the slope for the Days
variable?
# Use this code chunk for your answer.
anova(lm3,lm4)
Res.Df
<dbl>
RSS
<dbl>
Df
<dbl>
Sum of Sq
<dbl>
F
<dbl>
Pr(>F)
<dbl>
1
576
3.653322e+12
NA
NA
NA
NA
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
18/21
Res.Df
<dbl>
RSS
<dbl>
Df
<dbl>
Sum of Sq
<dbl>
F
<dbl>
Pr(>F)
<dbl>
2
575
1.949627e+12
1
1.703695e+12
502.4676
1.88366e-80
2 rows
Answer:
The F-statistic is significantly greater than 1, with a p-value less than 2.2e-16 indicating that model 2 is
a significantly better fit than model 1
part e
Calculate the Variance Inflation Factors for the four predictor variables, including Days. What do we use the
Variance Inflation Factor to help identify? What variables (if any) indicate a cause for concern? Explain.
# Use this code chunk for your answer.
vif_lm4
## Age EdYears Pressure Days ## 1.028723 1.024975 1.015065 1.018054
Answer:
We use if for identifying multicollinearity. There is no varaible to concern about multicollinearity in this
model.
Exercise 5: Credit Data [15 points]
For this exercise, use the Credit
data in the ISLR
package. Use the following line of code to remove the ID
variable, which is not useful for modeling.
data(Credit)
Credit = subset(Credit, select = -c(ID))
Use ?Credit
to learn about this dataset.
Our goal is to try to predict how much credit card Balance
an individual has based on other information about
them and their credit levels.
We will take a very systematic aproach – it’s not necessarily a “correct” approach, but it should help us make
appropriate modeling decisions.
Do the following:
First, let’s create a full model that includes all predictors.
Then compute the VIFs of the predictors in this model.
You should notice there is clear collinearity between two predictors; run two more models, one with one
of these predictors removed, and the other with the other predictor removed.
Using from these models, determine which of these two collinear predictors offers the weaker
contribution. Identify in the white space below which of these two predictors you are dropping from the
model.
Finally, calculate the measure of collinearity
for each of these two variables first from their VIFs
and then by fitting two more models, predicting the variable of interest from the other predictor
variables.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
19/21
# Use this code chunk for your answer.
full_model = lm(Balance ~ ., data = Credit)
#summary(full_model)
vif_full = vif(full_model)
#vif_full
model_without_Limit = lm(Balance ~ . - Limit, data = Credit)
model_without_Rating = lm(Balance ~ . - Rating, data = Credit)
#summary(model_without_Limit)
#summary(model_without_Rating)
vif(model_without_Limit)
## GVIF Df GVIF^(1/(2*Df))
## Income 2.784966 1 1.668822
## Rating 2.730561 1 1.652441
## Cards 1.019639 1 1.009772
## Age 1.051135 1 1.025249
## Education 1.013503 1 1.006729
## Gender 1.005848 1 1.002920
## Student 1.022092 1 1.010986
## Married 1.032237 1 1.015991
## Ethnicity 1.027285 2 1.006753
vif(model_without_Rating)
## GVIF Df GVIF^(1/(2*Df))
## Income 2.774623 1 1.665720
## Limit 2.709488 1 1.646052
## Cards 1.008299 1 1.004141
## Age 1.051328 1 1.025343
## Education 1.013501 1 1.006728
## Gender 1.005827 1 1.002909
## Student 1.022750 1 1.011311
## Married 1.032113 1 1.015929
## Ethnicity 1.026691 2 1.006607
vif_rating = 2.730561
vif_limit = 2.709488
r2_collinearity_limit = 1 - 1/vif_limit
r2_collinearity_rating = 1 - 1/vif_rating
r2_collinearity_limit
## [1] 0.6309266
r2_collinearity_rating
## [1] 0.6337749
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
20/21
model_limit = lm(Limit ~ Income + Rating + Cards + Age + Education + Gender + Student + Married + Ethnicity, data = Credit)
summary(model_limit)
## ## Call:
## lm(formula = Limit ~ Income + Rating + Cards + Age + Education + ## Gender + Student + Married + Ethnicity, data = Credit)
## ## Residuals:
## Min 1Q Median 3Q Max ## -394.10 -100.71 10.46 104.20 340.12 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -367.2272 52.1086 -7.047 8.35e-12 ***
## Income 0.1493 0.3622 0.412 0.6804 ## Rating 14.8891 0.0817 182.237 < 2e-16 ***
## Cards -72.0724 5.6333 -12.794 < 2e-16 ***
## Age -0.1450 0.4547 -0.319 0.7499 ## Education 3.7662 2.4643 1.528 0.1272 ## GenderFemale -0.3002 15.3350 -0.020 0.9844 ## StudentYes -48.7662 25.7481 -1.894 0.0590 . ## MarriedYes -34.4446 15.9339 -2.162 0.0312 * ## EthnicityAsian 25.9677 21.7997 1.191 0.2343 ## EthnicityCaucasian 2.8399 18.8858 0.150 0.8805 ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 152.8 on 389 degrees of freedom
## Multiple R-squared: 0.9957, Adjusted R-squared: 0.9956 ## F-statistic: 9065 on 10 and 389 DF, p-value: < 2.2e-16
model_rating = lm(Rating ~ Income + Limit + Cards + Age + Education + Gender + Student + Marrie
d + Ethnicity, data = Credit)
summary(model_rating)
2023/11/3 08:04
Homework 9
file:///F:/UIUC/STAT420/Homework9.html
21/21
## ## Call:
## lm(formula = Rating ~ Income + Limit + Cards + Age + Education + ## Gender + Student + Married + Ethnicity, data = Credit)
## ## Residuals:
## Min 1Q Median 3Q Max ## -22.9337 -7.1775 -0.5242 6.3086 27.8718 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 26.6256084 3.4394669 7.741 8.59e-14 ***
## Income 0.0307343 0.0241424 1.273 0.2038 ## Limit 0.0663856 0.0003643 182.237 < 2e-16 ***
## Cards 4.8756963 0.3740563 13.035 < 2e-16 ***
## Age 0.0053018 0.0303635 0.175 0.8615 ## Education -0.2515127 0.1645509 -1.528 0.1272 ## GenderFemale 0.0939119 1.0239560 0.092 0.9270 ## StudentYes 3.1405884 1.7198352 1.826 0.0686 . ## MarriedYes 2.3115248 1.0638932 2.173 0.0304 * ## EthnicityAsian -1.8296226 1.4553333 -1.257 0.2094 ## EthnicityCaucasian -0.1910695 1.2610642 -0.152 0.8796 ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 10.2 on 389 degrees of freedom
## Multiple R-squared: 0.9958, Adjusted R-squared: 0.9957 ## F-statistic: 9136 on 10 and 389 DF, p-value: < 2.2e-16
Answer:
Ther is a clear collinearity between Limiting and Rating. Model without rating offers a weaker
contribution.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help