Week 9 Sklearn Fish
pptx
keyboard_arrow_up
School
St. John's University *
*We aren’t endorsed by this school
Course
243
Subject
Statistics
Date
Feb 20, 2024
Type
pptx
Pages
7
Uploaded by DukeMeerkatPerson605
SciKit-Learn
Linear Regression
Linear Regression
●
Linear regression is a supervised machine learning algorithm ●
Target variable modeled on independent variables
●
Can be between univariate, multivariate
●
This lab demonstrates Linear regression using Sklearn
Import Libraries
We’re going to not just model, but display our data.
Step 1: Download the following libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Load and Explore Data
Use the Fish dataset. Choose the same two columns you used for previous lab.
Step 2: Load the data using pandas. It’s always a good idea to run df.head() after you load and at least describe(). Make sure the data looks as expected
Step 3: Plot the scatter plot
Seaborn has an lmplot which can display a scatter plot and draw a regression line. Use the following parameters: ci = None, line_kws={"color": "red"}. This will remove the confidence interval, and generate a red fit line. Compare it to the Statsmodels line. Is it close?
Generate the variables and Fit the model
●
Step 4: SKlearn works with Arrays. You’ll need to convert your X and Y variables into 1D numpy arrays. There’s many ways to do this. Try reshape(-
1,1)
●
Step 5: We’ve imported the train_test_split module. This lets us perform a split for training purposes.
○
Data is split into training dataset, used to model the data
○
Testing dataset used to check accuracy
●
The code here is a little bit tough. The code is in the notes section of this slide deck. Run this code.
●
Note the score - that’s how accurate your X variable is at predicting your Y.
Visualize the model
Step 6 Sklearn uses regr.predict to predict the values for the predicted line. This line should be the best fit. Your line may look very similar to the OLS line you generated above. This will depend on the variables you chose. I chose Length1 and Weight. My accuracy was 81%. My line was nearly straight. Therefore, it looks almost the same as this model.
y_pred = regr.predict(X_test)
plt.scatter(X_test, y_test, color ='b')
plt.plot(X_test, y_pred, color ='r')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Metrics
We already know our accuracy, that’s contained in the variable regr.score. Similar to the way Statsmodels had a model fit, we can also generate some model statistics with Sklearn. Run the below to generate some model statistics. The RMSE or root mean squared error is the most important. Think of it as the average error of the model. The smaller the better. What was the RMSE of your model?
from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
Related Documents
Related Questions
IConsider the following multiple linear regression model and the Excel
print out of its regression results:
Beer = Bo + BIEDUC + B2AGE + BAGE? + BAGENDER + BERACE+ BEGENDER*RACE+ E, where
Beer is monthly beer consumption (ounces), EDUC is years of education. We have 2 qualitative variables:
gender and race. Gender takes 2 values, GEN=1 if the person is male and GEN=0 for females. The
variable race also takes 2 values, RACE=1 if the person is white and RACE=0 if the person is not white.
SUMMARY OUTPUT
Regression Statistics
R Square
Adjusted R Square
Standard Error
Observations
???
0.4684
???
40
ANOVA
df
MS
Regression
Residual
???
319.3
64.8
???
???
???
8.43
Total
???
597.5
Coefficients
Standard Error
Intercept
-150.254
107.397
EDUC
-16.7755
8.4579
75.45905
-1.72456
AGE
37.3261
AGE
0.5397
GEN
238.9424
81.6054
RACE
123.7404
103.1804
GEN. RACE
76.4308
51.0670
a. Calculate the missing numbers (???).
b.Interpret the parameter of RACE (123.74).
c. Is the parameter of RACE (Bs) significant?…
arrow_forward
A researcher conducts a multiple regression with Y as the dependent variable and X1,
X2, X3 and X4 as explanatory variables. Using the regression output below, fully
describe this model and discuss important parts of the output. What is the predicted
value of Y if X1 = 3, X2 = 15, X3 = 7 and X4 = 0.003?
%3D
SUMMARY OUTPUT
Regression Staistics
Muliple R
R Square
Adjusted R Square
Standard Emor
Observations
0.7236
0.5236
0.5159
5.3928
252
ANOVA
Significance F
1. 10662E-38
SS
MS
Regression
Residual
1973 9392
29.0820
67.8749
7895.7567
7183.2599
4
247
Total
251
15079.0166
Upper 95%
33.4049
Coefficients
Standard Eror
t Stat
Pvalue
2.2273 0.026830873
Lower 95%
7.9594
2.0508
Intercept
X1
17.7278
1.5583
0.2750
5.6662
4.05265E-08
1.0166
2.0999
X2
1.8376
0.1997
9.1999
1.4442
-74708
-3721 4324
1.55861E-17
2.2310
X3
55100
-5.5348
7.94036E-08
-3.5492
X4
-3.1079
1887 8435
-0.0016
0.998687788
3715 2166
arrow_forward
We have data on Lung Capacity of persons and we wish
to build a multiple linear regression model that predicts
Lung Capacity based on the predictors Age and
Smoking Status. Age is a numeric variable whereas
Smoke is a categorical variable (0 if non-smoker, 1 if
smoker). Here is the partial result from STATISTICA.
b*
Std.Err.
of b*
Std.Err.
N=725
of b
Intercept
Age
Smoke
0.835543
-0.075120
1.085725
0.555396
0.182989
0.014378
0.021631
0.021631
-0.648588
0.186761
Which of the following statements is absolutely false?
A. The expected lung capacity of a smoker is expected
to be 0.648588 lower than that of a non-smoker.
B. The predictor variables Age and Smoker both
contribute significantly to the model.
C. For every one year that a person gets older, the lung
capacity is expected to increase by 0.555396 units,
holding smoker status constant.
D. For every one unit increase in smoker status, lung
capacity is expected to decrease by 0.648588 units,
holding age constant.
arrow_forward
Independent variable data is listed in cells B2 through B100, and dependent variable data is in cells C2 through C100. Which spreadsheet function would calculate the slope of a linear regression model of this data?
Group of answer choices
=SLOPE(B2:B100,C2:C100)
=SLOPE(C2:C100,B2:B100)
=SLOPE(B2,C2)
=SLOPE(C2,C100,B2,B100)
arrow_forward
A group of Maternal and Child Health public health practitioners are interested in the relationship between bacterial vaginosis (BV) and a number of negative health outcomes. Suppose the research team gathers information on a group of participants, and constructs a multiple linear regression model looking at the relationship between BV and depression, controlling for maternal age. The following is a computerized output displaying the results of their analysis.Parameter Intercept Maternal Age DepressionEstimate StandardError tValue Pr>|t|0.2186206635 -.0046496845 0.19124124150.06635040 0.00221338 0.031518843.29 0.0010 -2.10 0.0360 6.07 <.0001
A) What are the dependent and independent variables in this investigation?B) Based on the information above, was the research team justified in controlling for maternal age in this population? Why or why not?C) Write out the model in symbols. Round to 3 decimal places.D) Is there a significant association between BV and depression?
arrow_forward
Corvette, Ferrari, and Jaguar produced a variety of classic cars that continue to increase in value. The data showing the rarity rating (1–20) and the high price ($1000s) for 15 classic cars is contained in the Excel Online file below. Construct a spreadsheet to answer the following questions.
Open spreadsheet
Develop a scatter diagram of the data using the rarity rating as the independent variable and price as the dependent variable. Does a simple linear regression model appear to be appropriate?
A simple linear regression model _________appearsdoes not appear to be appropriate.
Develop an estimated multiple regression equation with rarity rating and as the two independent variables.
(to whole numbers)
What is the value of the coefficient of determination? Note: report between 0 and 1.
(to 3 decimals)
What is the value of the test statistic?
(to 2 decimals)
What is the -value?
(to 4 decimals)
Consider the nonlinear relationship shown by equation .…
arrow_forward
Wanting to study the effect of exercise on preventing the common cold, a researcher collects 500 test subjects and randomly assigns them to a treatment group instructed to exercise and a control group instructed to not exercise. He later records the number of colds for each group. This is an example of:
linear regression
an experiment
an observational study
independence
arrow_forward
A residual plot is a plot in which the residuals are plot ted against the value of
the explanatory variable x.
When a residual plot exabits a noticeable pattern, the variables do not have a
linear relationship, and the least square regression line should not be used.
When a residual plot exabits no noticeable pattern, the least square regression
line may be used to describe the relationship between the variables.
For each of the following residual plots, determine whether a linear model is
appropriate.
a.
Residual
0.5
2.0
1.5
1.0
0505
0
-0.5
-1.0
-1.5
-2.0
-2
The residual plot in (a)
[Select]
square regression line
[Select]
b.
Residual
19
15
10
5
0
-5
-10-
-15
-10
10
15
and thus the least-
The residual plot in (b) does not exabit a noticeable pattern and thus the least
-square regression line regression line may be used
arrow_forward
Suppose a multiple regression model is fitted into a variable called model. Which
Python method below returns residuals for a data set based on a multiple regression
model? Select one.
model.residualsvalues
O model.residvalues
model.residuals
model.resid
arrow_forward
The residual plot for a linear regression model is shown below. Assess the fit of the linear model, and justify your answer.
The line is a good fit because the points on the residual plot have a clear pattern.
The line is a good fit because the points on the residual plot do not have any noticeable pattern.
The line is not a good fit because the points on the residual plot do not have any noticeable pattern.
The line is not a good fit because the points on the residual plot have a clear pattern.
arrow_forward
Pls help ASAP. Pls show all work.
arrow_forward
Pls help ASAP. Pls show all work.
arrow_forward
Tire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model
What is the predicted average mileage at tire pressure x = 31?
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning

College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning


Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning
Related Questions
- IConsider the following multiple linear regression model and the Excel print out of its regression results: Beer = Bo + BIEDUC + B2AGE + BAGE? + BAGENDER + BERACE+ BEGENDER*RACE+ E, where Beer is monthly beer consumption (ounces), EDUC is years of education. We have 2 qualitative variables: gender and race. Gender takes 2 values, GEN=1 if the person is male and GEN=0 for females. The variable race also takes 2 values, RACE=1 if the person is white and RACE=0 if the person is not white. SUMMARY OUTPUT Regression Statistics R Square Adjusted R Square Standard Error Observations ??? 0.4684 ??? 40 ANOVA df MS Regression Residual ??? 319.3 64.8 ??? ??? ??? 8.43 Total ??? 597.5 Coefficients Standard Error Intercept -150.254 107.397 EDUC -16.7755 8.4579 75.45905 -1.72456 AGE 37.3261 AGE 0.5397 GEN 238.9424 81.6054 RACE 123.7404 103.1804 GEN. RACE 76.4308 51.0670 a. Calculate the missing numbers (???). b.Interpret the parameter of RACE (123.74). c. Is the parameter of RACE (Bs) significant?…arrow_forwardA researcher conducts a multiple regression with Y as the dependent variable and X1, X2, X3 and X4 as explanatory variables. Using the regression output below, fully describe this model and discuss important parts of the output. What is the predicted value of Y if X1 = 3, X2 = 15, X3 = 7 and X4 = 0.003? %3D SUMMARY OUTPUT Regression Staistics Muliple R R Square Adjusted R Square Standard Emor Observations 0.7236 0.5236 0.5159 5.3928 252 ANOVA Significance F 1. 10662E-38 SS MS Regression Residual 1973 9392 29.0820 67.8749 7895.7567 7183.2599 4 247 Total 251 15079.0166 Upper 95% 33.4049 Coefficients Standard Eror t Stat Pvalue 2.2273 0.026830873 Lower 95% 7.9594 2.0508 Intercept X1 17.7278 1.5583 0.2750 5.6662 4.05265E-08 1.0166 2.0999 X2 1.8376 0.1997 9.1999 1.4442 -74708 -3721 4324 1.55861E-17 2.2310 X3 55100 -5.5348 7.94036E-08 -3.5492 X4 -3.1079 1887 8435 -0.0016 0.998687788 3715 2166arrow_forwardWe have data on Lung Capacity of persons and we wish to build a multiple linear regression model that predicts Lung Capacity based on the predictors Age and Smoking Status. Age is a numeric variable whereas Smoke is a categorical variable (0 if non-smoker, 1 if smoker). Here is the partial result from STATISTICA. b* Std.Err. of b* Std.Err. N=725 of b Intercept Age Smoke 0.835543 -0.075120 1.085725 0.555396 0.182989 0.014378 0.021631 0.021631 -0.648588 0.186761 Which of the following statements is absolutely false? A. The expected lung capacity of a smoker is expected to be 0.648588 lower than that of a non-smoker. B. The predictor variables Age and Smoker both contribute significantly to the model. C. For every one year that a person gets older, the lung capacity is expected to increase by 0.555396 units, holding smoker status constant. D. For every one unit increase in smoker status, lung capacity is expected to decrease by 0.648588 units, holding age constant.arrow_forward
- Independent variable data is listed in cells B2 through B100, and dependent variable data is in cells C2 through C100. Which spreadsheet function would calculate the slope of a linear regression model of this data? Group of answer choices =SLOPE(B2:B100,C2:C100) =SLOPE(C2:C100,B2:B100) =SLOPE(B2,C2) =SLOPE(C2,C100,B2,B100)arrow_forwardA group of Maternal and Child Health public health practitioners are interested in the relationship between bacterial vaginosis (BV) and a number of negative health outcomes. Suppose the research team gathers information on a group of participants, and constructs a multiple linear regression model looking at the relationship between BV and depression, controlling for maternal age. The following is a computerized output displaying the results of their analysis.Parameter Intercept Maternal Age DepressionEstimate StandardError tValue Pr>|t|0.2186206635 -.0046496845 0.19124124150.06635040 0.00221338 0.031518843.29 0.0010 -2.10 0.0360 6.07 <.0001 A) What are the dependent and independent variables in this investigation?B) Based on the information above, was the research team justified in controlling for maternal age in this population? Why or why not?C) Write out the model in symbols. Round to 3 decimal places.D) Is there a significant association between BV and depression?arrow_forwardCorvette, Ferrari, and Jaguar produced a variety of classic cars that continue to increase in value. The data showing the rarity rating (1–20) and the high price ($1000s) for 15 classic cars is contained in the Excel Online file below. Construct a spreadsheet to answer the following questions. Open spreadsheet Develop a scatter diagram of the data using the rarity rating as the independent variable and price as the dependent variable. Does a simple linear regression model appear to be appropriate? A simple linear regression model _________appearsdoes not appear to be appropriate. Develop an estimated multiple regression equation with rarity rating and as the two independent variables. (to whole numbers) What is the value of the coefficient of determination? Note: report between 0 and 1. (to 3 decimals) What is the value of the test statistic? (to 2 decimals) What is the -value? (to 4 decimals) Consider the nonlinear relationship shown by equation .…arrow_forward
- Wanting to study the effect of exercise on preventing the common cold, a researcher collects 500 test subjects and randomly assigns them to a treatment group instructed to exercise and a control group instructed to not exercise. He later records the number of colds for each group. This is an example of: linear regression an experiment an observational study independencearrow_forwardA residual plot is a plot in which the residuals are plot ted against the value of the explanatory variable x. When a residual plot exabits a noticeable pattern, the variables do not have a linear relationship, and the least square regression line should not be used. When a residual plot exabits no noticeable pattern, the least square regression line may be used to describe the relationship between the variables. For each of the following residual plots, determine whether a linear model is appropriate. a. Residual 0.5 2.0 1.5 1.0 0505 0 -0.5 -1.0 -1.5 -2.0 -2 The residual plot in (a) [Select] square regression line [Select] b. Residual 19 15 10 5 0 -5 -10- -15 -10 10 15 and thus the least- The residual plot in (b) does not exabit a noticeable pattern and thus the least -square regression line regression line may be usedarrow_forwardSuppose a multiple regression model is fitted into a variable called model. Which Python method below returns residuals for a data set based on a multiple regression model? Select one. model.residualsvalues O model.residvalues model.residuals model.residarrow_forward
- The residual plot for a linear regression model is shown below. Assess the fit of the linear model, and justify your answer. The line is a good fit because the points on the residual plot have a clear pattern. The line is a good fit because the points on the residual plot do not have any noticeable pattern. The line is not a good fit because the points on the residual plot do not have any noticeable pattern. The line is not a good fit because the points on the residual plot have a clear pattern.arrow_forwardPls help ASAP. Pls show all work.arrow_forwardPls help ASAP. Pls show all work.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Elementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage LearningAlgebra and Trigonometry (MindTap Course List)AlgebraISBN:9781305071742Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage LearningCollege AlgebraAlgebraISBN:9781305115545Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage Learning
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillFunctions and Change: A Modeling Approach to Coll...AlgebraISBN:9781337111348Author:Bruce Crauder, Benny Evans, Alan NoellPublisher:Cengage Learning

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning

College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning


Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning