Homework 9

pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

STAT420

Subject

Industrial Engineering

Date

Apr 3, 2024

Type

pdf

Pages

21

Report

Uploaded by SargentWaterHorse4

2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 1/21 Homework 9 Wenbin Nie Due 11/2/2023 Homework Instructions Make sure to add your name to the header of the document. When submitting the assignment on Gradescope, be sure to assign the appropriate pages of your submission to each Exercise. The point value for each exercise is noted in the exercise title. For questions that require code, please create or use the code chunk directly below the question and type your code there. Your knitted pdf will then show both the code and the output, so that we can assess your understanding and award any partial credit. For written questions, please provide your answer after the indicated Answer prompt. You are encouraged to knit your file as you work, to check that your coding and formatting are done so appropriately. This will also help you identify and locate any errors more easily. Homework Setup We’ll use the following packages for this homework assignment. We’ll also read in data from a csv file. To access the data, you’ll want to download the dataset from Canvas and place it in the same folder as this R Markdown document. You’ll then be able to use the following code to load in the data. library (ggplot2) library (faraway) library (ISLR) ## Warning: package 'ISLR' was built under R version 4.3.2 library (car) ## Warning: package 'car' was built under R version 4.3.2 ## Loading required package: carData ## Warning: package 'carData' was built under R version 4.3.2 ## ## Attaching package: 'car' ## The following objects are masked from 'package:faraway': ## ## logit, vif
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 2/21 Exercise 1: Formatting [5 points] The first five points of the assignment will be earned for properly formatting your final document. Check that you have: included your name on the document properly assigned pages to exercises on Gradescope selected page 1 (with your name) and this page for this exercise (Exercise 1) all code is printed and readable for each question all output is printed generated a pdf file Exercise 2: Scottish Hill Races [30 points] For this exercise, we’ll use the races.table dataset that includes information on record-winning times (minutes) for 35 hill races in Scotland, as reported by Atkinson (1986). The additional variables record the overall distance travelled (miles) and the height climbed in the race. Below, we are reading in the data from an online source. We do correct one error reported by Atkinson before beginning our analysis and adjust the height climbed to be recorded in thousands of feet. Source: Atkinson, A. C. (1986). Comment: Aspects of diagnostic regression analysis (discussion of paper by Chatterjee and Hadi). Statistical Science , 1 , 397-402. url = 'http://www.statsci.org/data/general/hills.txt' races.table = read.table(url, header=TRUE, sep='\t') races.table[18,4] = 18.65 races.table$Climb = races.table$Climb / 1000 head(races.table) Race <chr> Distance <dbl> Climb <dbl> Time <dbl> 1 Greenmantle 2.5 0.650 16.083 2 Carnethy 6.0 2.500 48.350 3 CraigDunain 6.0 0.900 33.650 4 BenRha 7.5 0.800 45.600 5 BenLomond 8.0 3.070 62.267 6 Goatfell 8.0 2.866 73.217 6 rows part a Create a scatterplot matrix of the quantitative variables contained in the race.table dataset. Interpret this scatterplot matrix. What variable do you think will be more important in predicting the record time of that race?
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 3/21 # Use this code chunk for your answer. for_matrix1 = races.table[,2:4] pairs(for_matrix1) Answer: It seems like Distance is more important part b Fit a multiple regression model predicting the record time of a race from the distance travelled, the height climbed, and an interaction of the two variables. Report the summary of the model. What is the for this model? What does this suggest about the strength of the model? # Use this code chunk for your answer. lm1 = lm (Time ~ Distance * Climb, data = races.table) summary(lm1)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 4/21 ## ## Call: ## lm(formula = Time ~ Distance * Climb, data = races.table) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.3078 -2.8309 0.7048 2.2312 18.9270 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.3532 3.9122 -0.090 0.928638 ## Distance 4.9290 0.4750 10.377 1.32e-11 *** ## Climb 3.5217 2.3686 1.487 0.147156 ## Distance:Climb 0.6731 0.1746 3.856 0.000545 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.35 on 31 degrees of freedom ## Multiple R-squared: 0.9806, Adjusted R-squared: 0.9787 ## F-statistic: 521.1 on 3 and 31 DF, p-value: < 2.2e-16 Answer: The R^2 value of 0.9806 suggests that the model is strong. It explains a 98.06% of the variance in record time based on the independent variables, which include “Distance,” “Climb,” and their interaction. part c Interpret the first-order coefficient pertaining to Distance . Then, calculate the slopes for Distance for a race whose Climb is 0.3 (300 feet) and again for an individual whose Climb is 3 (3000 feet). # Use this code chunk for your answer. coef_distance = 4.9290 coef_interaction = 0.6731 climb_value_1 = 0.3 slope_1 = coef_distance + (coef_interaction * climb_value_1) climb_value_2 = 3 slope_2 = coef_distance + (coef_interaction * climb_value_2) slope_1 ## [1] 5.13093 slope_2 ## [1] 6.9483 Answer: The first-order coefficient pertaining to Distance is 4.929 indicating that for each runner in the race, their race time will increase by 4.929 minutes for each increase in 1 mile of distance. The slopes for Distance for a race whose Climb is 0.3 (300 feet) and again for an individual whose Climb is 3 (3000 feet) are 5.13093 and 6.9483 respectively.
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 5/21 part d Identify any influential points as defined in the lecture. Which of these observations, if any, are especially influential based on their values? For these influential points, do they have high leverage, high standardized residual, both, or neither? # Use this code chunk for your answer. cooks.distance(lm1)[cooks.distance(lm1) > (4/35)] ## 7 11 35 ## 3.758307 2.704165 1.805942 hatvalues(lm1)[hatvalues(lm1) > (8/35)] ## 7 11 33 35 ## 0.5207512 0.7182517 0.2379383 0.3261854 (rstandard(lm1)[abs(rstandard(lm1)) > 2]) ## 7 11 35 ## 3.719559 2.059866 -3.862957 Answer: Observation 7,11 and 35 are influential based on Cook’s distances, leverage and s.r. These observations exhibit both high leverage and high standardized residuals, which make them particularly influential in the model. part e Refit the model from part b without any points that you identified as influential. Note: this is not something that we should automatically do, but we will do it for now as a demonstration of how much our model may be affected by these points! Print the coefficients for this model. How do they compare to the coefficients from the model in part b? Hint: Create a subset of your data that only includes those points that are not influential before fitting your data. # Use this code chunk for your answer. remove_obs = races.table[-c(7, 11, 35),] lm2 = lm(Time ~ Distance * Climb, data = remove_obs) summary(lm2)
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 6/21 ## ## Call: ## lm(formula = Time ~ Distance * Climb, data = remove_obs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.1574 -2.7089 0.3387 2.2074 10.3180 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.6141 3.3490 0.183 0.855828 ## Distance 5.1003 0.6079 8.390 3.98e-09 *** ## Climb 1.8117 1.7531 1.033 0.310273 ## Distance:Climb 0.7105 0.1663 4.272 0.000202 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.877 on 28 degrees of freedom ## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9754 ## F-statistic: 411.1 on 3 and 28 DF, p-value: < 2.2e-16 Answer: After removing the problematic data points, the new model (lm2) still works well and can predict race times accurately. The changes in the coefficients show that the problematic points did affect the results, but the modified model is still reliable and valid for making predictions. We can see that the coef for Distance increase around 1 and coef. for climb decrease significantly while the interception becomes positvie. ## part f How much does this updated model affect our actual predictions for the response? Let’s create a scatterplot that compares our fitted values from our original model to those from our newer model (influential points removed). Calculate and save each of the fitted values (for the original model and for the newer model) to their own named object in R . Note: If you are using the predict function, you can supply as an argument newdata = races.table since we will use all of the variables and all of the data. Then, create a dataframe in R by providing your two named objects with fitted values as two arguments inside the data.frame function, and save the result to a new named object in R . Now, create a scatterplot to compare the fitted values for each model. Include an appropriate title and axes labels. All other formatting is optional and up to you! It might be helpful to add a line with intercept 0 and slope 1 to represent what perfect matching would look like. Finally, briefly comment on what this plot reveals. Would you say there are big differences in the predictions made by each model, or would you say the predictions by each model are quite similar? Is this what you would expect from the results in part d? # Use this code chunk for your answer. old = fitted(lm1) old = old[-c(7, 11, 35)] new = fitted(lm2) df = data.frame(old, new) ggplot(df, aes(x = old, y = new)) + geom_point() + labs(x = 'Old', y = 'New', title = 'Comaprison') + geom_abline(slope = 1, intercept = 0, color = 'Red')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 7/21 Answer: the two models are producing nearly the same predictions for the response variable, so I woudn’ t say that there is a huge difference between two models. This is what I expected since they have similar R^2. Exercise 3: Hospital SUPPORT Data: Unusual Observations [29 points] For this exercise, we will use the data stored in hospital.csv on Canvas. It contains a random sample of 580 seriously ill hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are: Days - Day to death or hospital discharge Age - Age on day of hospital admission Sex - Female or male Comorbidity - Patient diagnosed with more than one chronic disease EdYears - Years of education Education - Education level; high or low Income - Income level; high or low Charges - Hospital charges, in dollars Care - Level of care required; high or low Race - Non-white or white Pressure - Blood pressure, in mmHg Blood - White blood cell count, in gm/dL Rate - Heart rate, in bpm
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 8/21 part a Fit a model with Charges as the response, and with predictors of EdYears , Pressure , and Age . # Use this code chunk for your answer. hospital = read.csv("F:\\UIUC\\STAT420\\hospital.csv") lm3 = lm(Charges ~ Age + EdYears + Pressure, data = hospital) summary(lm3) ## ## Call: ## lm(formula = Charges ~ Age + EdYears + Pressure, data = hospital) ## ## Residuals: ## Min 1Q Median 3Q Max ## -70326 -41609 -26872 5233 477250 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 79567.4 21731.1 3.661 0.000274 *** ## Age -643.0 211.1 -3.047 0.002421 ** ## EdYears 1407.7 906.4 1.553 0.120937 ## Pressure -33.8 126.6 -0.267 0.789536 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 79640 on 576 degrees of freedom ## Multiple R-squared: 0.02279, Adjusted R-squared: 0.0177 ## F-statistic: 4.478 on 3 and 576 DF, p-value: 0.00405 part b Calculate the leverages for each observation in the dataset. How many observations have leverages above our course threshold? Make a histogram of all leverages for the dataset. Does the course threshold seem to fall at a good cutoff for this model? # Use this code chunk for your answer. n = length(resid(lm3)) p = length(coef(lm3)) n ## [1] 580 p ## [1] 4 2 * p/n ## [1] 0.0137931
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 9/21 hatvalues = hatvalues(lm3) above = hatvalues[hatvalues > (2 * p/n)] above ## 2 10 11 15 19 23 27 ## 0.01596837 0.04627055 0.01570829 0.01542179 0.01583249 0.01598336 0.01543236 ## 57 118 130 131 153 180 198 ## 0.01794143 0.02054003 0.02644400 0.01671699 0.01481928 0.01456018 0.02423647 ## 201 209 222 224 265 282 289 ## 0.01524729 0.01754252 0.01489532 0.01509594 0.01745183 0.01441938 0.01450108 ## 298 317 341 349 368 402 407 ## 0.02035020 0.02056658 0.01486582 0.01575459 0.02041636 0.02053973 0.02234620 ## 423 443 469 474 499 511 513 ## 0.02178170 0.03114276 0.01465574 0.01648916 0.02107845 0.02138013 0.01414533 ## 550 556 575 580 ## 0.01689769 0.01688995 0.01496846 0.01570199 length(above) ## [1] 39 hist(hatvalues, breaks = 20) Answer: 39 observations above the threshold.Base on the histogram, I think the threshold is reasonable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 10/21 part c Calculate the standardized residuals for each observation in the dataset. How many observations are designated as having a high standardized residual based on our course threshold? Generate a histogram of all standardized residuals for the dataset. What is the shape of this histogram? # Use this code chunk for your answer. sr = rstandard(lm3) above_sr = rstandard(lm3)[abs(rstandard(lm3)) > 2] hist(rstandard(lm3)) length(above_sr) ## [1] 32 Answer: 32 observations above our course threshold. The shape of the histogram is right skewed part d Calculate the Cook’s distance for each observation in the dataset. Print only those observations that are above the threshold defined in lecture. After looking through these Cook’s distances by eye, the Cook’s distance for what specific observations, if any, appear to be especially large? Finally, what is Cook’s distance used to measure?
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 11/21 # Use this code chunk for your answer. above_c = cooks.distance(lm3)[cooks.distance(lm3) > (4/n)] above_c ## 2 3 14 15 16 24 ## 0.030335886 0.022467402 0.014247049 0.017506191 0.049720108 0.007418391 ## 26 34 35 38 39 53 ## 0.049569896 0.015919473 0.039607688 0.012293791 0.025586612 0.060476097 ## 58 67 74 75 77 111 ## 0.045286259 0.019589190 0.010380495 0.009795234 0.007830997 0.015021677 ## 191 197 204 205 218 224 ## 0.035650610 0.007857909 0.045482460 0.010155139 0.013588456 0.014673965 ## 249 252 257 290 327 351 ## 0.008886250 0.036517316 0.007009659 0.012064199 0.007414654 0.011625540 ## 368 402 479 ## 0.038139902 0.055833796 0.025653720 hist(cooks.distance(lm3)) Answer: Observation 53, 58,252 and etc. seem pretty large. Cook’s distance is good at identifying influential observations part e Generate the default plots in R . Then, interpret each of these plots.
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 12/21 # Use this code chunk for your answer. plot(lm3)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 13/21
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 14/21 Answer: In the residuals vs fitted plot, base on the plot I don’t think the linearity assumption is valid. The QQ plot suggests that the normality assumption is not valid. In the scale-location plot, I don’t think the residuals are randomly spreaded, but accumulate around the red line. In residuals vs leverage plot, an observation is far
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 15/21 beyond the Cook??s distance lines and other sevearl observations have large cook’s distance too. part f In order to assess the fit of this model, calculate the value of the RMSE using leave one out cross validation. # Use this code chunk for your answer. sqrt(mean((resid(lm3)/(1-hatvalues(lm3)))^2)) ## [1] 79951.01 Exercise 4: Hospital SUPPORT Data, Days Variable [21 points] For this exercise, we will continue analyzing the hospital dataset. We will focus in particular on whether we should add the Days variable to the model from Question 3 (predicting Charges from EdYears, Pressure, and Age). part a Calculate the measure of collinearity for the Days variable. What does this information tell us? # Use this code chunk for your answer. lm4 = lm(Charges ~ Age + EdYears + Pressure + Days, data = hospital) vif_lm4 = vif(lm4) vif_days = vif_lm4["Days"] vif_days ## Days ## 1.018054 lm_dropday = lm(Charges ~ . - Days, data = hospital) summary(lm_dropday)$r.squared ## [1] 0.2551996 r2 = 1-1/vif_days r2 ## Days ## 0.01773431 Answer: This suggests that I can confidently include the Days variable in the model without concerning about multicollinearity. measure of collinearity tells me the percent of the variation in our drop variable that is explained by its linear relationship with other predictors.It also tells us how much information is in our drop vairable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 16/21 part b In this question, we’ll create the partial correlation coefficient and the variable added plot for adding the Days variable for the Question 3 model. To start, create and save the residuals for the two models needed for this calculation. Save both of the residuals to their own R objects. Calculate the partial correlation coefficient for the considered predictor variable Days . Then, generate the variable added plot for this considered predictor variable. If you aren’t sure how to create the variable added plot with R code, refer to the last part of Textbook Section 15.2.1 (just before the end of the section) for a model of the code. Make sure to include an appropriate title and axes labels. What do the partial correlation coefficient and the variable added plot indicate about adding the Days variable to the model? # Use this code chunk for your answer. residuals_no_days = resid(lm3) residuals_with_days = resid(lm4) partial_corr_coef = cor(residuals_no_days, residuals_with_days) partial_corr_coef ## [1] 0.7305195 avPlots(lm4)
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 17/21 Answer: My partial correlation coefficient suggests a moderate to strong positive relationship between the residuals of the model without the Days variable and the residuals of the model with the Days variable, while controlling for the other predictors in the model.From the plot we can see the slope being positive and the the scatters are located around the line. The slop the the line is also relatively steep. part c Fit a linear model to the variable added plot. Hint: you can use the residuals directly in the lm function in the y and x locations without needing to create a new data frame. Is the slope for this linear model significantly different from 0? What does that suggest in terms of adding the Days variable to the model? # Use this code chunk for your answer. lm_variable_added = lm(residuals_with_days ~ residuals_no_days) summary(lm_variable_added) ## ## Call: ## lm(formula = residuals_with_days ~ residuals_no_days) ## ## Residuals: ## Min 1Q Median 3Q Max ## -325244 -7587 3664 12132 198241 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.637e-11 1.647e+03 0.00 1 ## residuals_no_days 5.337e-01 2.075e-02 25.72 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39660 on 578 degrees of freedom ## Multiple R-squared: 0.5337, Adjusted R-squared: 0.5329 ## F-statistic: 661.4 on 1 and 578 DF, p-value: < 2.2e-16 Answer: The slope coefficient for residuals_no_days is estimated to be approximately 0.5337, and base on the p-value less than the typical significance level, indicating that the slope is highly significant Overall, it suggests that adding the Days variable to the model has a substantial impact on the model’s fit. part d Generate an ANOVA test between our two models, and print the resulting table. Which model do you prefer, and what does that indicate about the slope for the Days variable? # Use this code chunk for your answer. anova(lm3,lm4) Res.Df <dbl> RSS <dbl> Df <dbl> Sum of Sq <dbl> F <dbl> Pr(>F) <dbl> 1 576 3.653322e+12 NA NA NA NA
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 18/21 Res.Df <dbl> RSS <dbl> Df <dbl> Sum of Sq <dbl> F <dbl> Pr(>F) <dbl> 2 575 1.949627e+12 1 1.703695e+12 502.4676 1.88366e-80 2 rows Answer: The F-statistic is significantly greater than 1, with a p-value less than 2.2e-16 indicating that model 2 is a significantly better fit than model 1 part e Calculate the Variance Inflation Factors for the four predictor variables, including Days. What do we use the Variance Inflation Factor to help identify? What variables (if any) indicate a cause for concern? Explain. # Use this code chunk for your answer. vif_lm4 ## Age EdYears Pressure Days ## 1.028723 1.024975 1.015065 1.018054 Answer: We use if for identifying multicollinearity. There is no varaible to concern about multicollinearity in this model. Exercise 5: Credit Data [15 points] For this exercise, use the Credit data in the ISLR package. Use the following line of code to remove the ID variable, which is not useful for modeling. data(Credit) Credit = subset(Credit, select = -c(ID)) Use ?Credit to learn about this dataset. Our goal is to try to predict how much credit card Balance an individual has based on other information about them and their credit levels. We will take a very systematic aproach – it’s not necessarily a “correct” approach, but it should help us make appropriate modeling decisions. Do the following: First, let’s create a full model that includes all predictors. Then compute the VIFs of the predictors in this model. You should notice there is clear collinearity between two predictors; run two more models, one with one of these predictors removed, and the other with the other predictor removed. Using from these models, determine which of these two collinear predictors offers the weaker contribution. Identify in the white space below which of these two predictors you are dropping from the model. Finally, calculate the measure of collinearity for each of these two variables first from their VIFs and then by fitting two more models, predicting the variable of interest from the other predictor variables.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 19/21 # Use this code chunk for your answer. full_model = lm(Balance ~ ., data = Credit) #summary(full_model) vif_full = vif(full_model) #vif_full model_without_Limit = lm(Balance ~ . - Limit, data = Credit) model_without_Rating = lm(Balance ~ . - Rating, data = Credit) #summary(model_without_Limit) #summary(model_without_Rating) vif(model_without_Limit) ## GVIF Df GVIF^(1/(2*Df)) ## Income 2.784966 1 1.668822 ## Rating 2.730561 1 1.652441 ## Cards 1.019639 1 1.009772 ## Age 1.051135 1 1.025249 ## Education 1.013503 1 1.006729 ## Gender 1.005848 1 1.002920 ## Student 1.022092 1 1.010986 ## Married 1.032237 1 1.015991 ## Ethnicity 1.027285 2 1.006753 vif(model_without_Rating) ## GVIF Df GVIF^(1/(2*Df)) ## Income 2.774623 1 1.665720 ## Limit 2.709488 1 1.646052 ## Cards 1.008299 1 1.004141 ## Age 1.051328 1 1.025343 ## Education 1.013501 1 1.006728 ## Gender 1.005827 1 1.002909 ## Student 1.022750 1 1.011311 ## Married 1.032113 1 1.015929 ## Ethnicity 1.026691 2 1.006607 vif_rating = 2.730561 vif_limit = 2.709488 r2_collinearity_limit = 1 - 1/vif_limit r2_collinearity_rating = 1 - 1/vif_rating r2_collinearity_limit ## [1] 0.6309266 r2_collinearity_rating ## [1] 0.6337749
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 20/21 model_limit = lm(Limit ~ Income + Rating + Cards + Age + Education + Gender + Student + Married + Ethnicity, data = Credit) summary(model_limit) ## ## Call: ## lm(formula = Limit ~ Income + Rating + Cards + Age + Education + ## Gender + Student + Married + Ethnicity, data = Credit) ## ## Residuals: ## Min 1Q Median 3Q Max ## -394.10 -100.71 10.46 104.20 340.12 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -367.2272 52.1086 -7.047 8.35e-12 *** ## Income 0.1493 0.3622 0.412 0.6804 ## Rating 14.8891 0.0817 182.237 < 2e-16 *** ## Cards -72.0724 5.6333 -12.794 < 2e-16 *** ## Age -0.1450 0.4547 -0.319 0.7499 ## Education 3.7662 2.4643 1.528 0.1272 ## GenderFemale -0.3002 15.3350 -0.020 0.9844 ## StudentYes -48.7662 25.7481 -1.894 0.0590 . ## MarriedYes -34.4446 15.9339 -2.162 0.0312 * ## EthnicityAsian 25.9677 21.7997 1.191 0.2343 ## EthnicityCaucasian 2.8399 18.8858 0.150 0.8805 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 152.8 on 389 degrees of freedom ## Multiple R-squared: 0.9957, Adjusted R-squared: 0.9956 ## F-statistic: 9065 on 10 and 389 DF, p-value: < 2.2e-16 model_rating = lm(Rating ~ Income + Limit + Cards + Age + Education + Gender + Student + Marrie d + Ethnicity, data = Credit) summary(model_rating)
2023/11/3 08:04 Homework 9 file:///F:/UIUC/STAT420/Homework9.html 21/21 ## ## Call: ## lm(formula = Rating ~ Income + Limit + Cards + Age + Education + ## Gender + Student + Married + Ethnicity, data = Credit) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.9337 -7.1775 -0.5242 6.3086 27.8718 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 26.6256084 3.4394669 7.741 8.59e-14 *** ## Income 0.0307343 0.0241424 1.273 0.2038 ## Limit 0.0663856 0.0003643 182.237 < 2e-16 *** ## Cards 4.8756963 0.3740563 13.035 < 2e-16 *** ## Age 0.0053018 0.0303635 0.175 0.8615 ## Education -0.2515127 0.1645509 -1.528 0.1272 ## GenderFemale 0.0939119 1.0239560 0.092 0.9270 ## StudentYes 3.1405884 1.7198352 1.826 0.0686 . ## MarriedYes 2.3115248 1.0638932 2.173 0.0304 * ## EthnicityAsian -1.8296226 1.4553333 -1.257 0.2094 ## EthnicityCaucasian -0.1910695 1.2610642 -0.152 0.8796 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.2 on 389 degrees of freedom ## Multiple R-squared: 0.9958, Adjusted R-squared: 0.9957 ## F-statistic: 9136 on 10 and 389 DF, p-value: < 2.2e-16 Answer: Ther is a clear collinearity between Limiting and Rating. Model without rating offers a weaker contribution.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help