Suppose we have a set of data consisting of ordered pairs and we suspect the x and y coordinates are related. It is natural to try to find the best line that fits the data points. If we can find this line, then we can use it to make all sorts of other predictions. In this project, we're going to use several functions to find this line using a technique called least squares regression. The result will be what we call the least squares regression line (or LSRL for short).   In order to do this, you'll need to program a statistical computation called the correlation coefficient, denoted by r in statistical symbols:       NOTE: Equation is written assuming you start at the value 1.  Lists start at index 0.   Once you have the correlation coefficient, you use it along with the sample means and sample standard deviations of the x and y-coordinates to compute the slope and y-intercept of your regression line via these formulas:     Tasks: In this project, you must read the x- and y-coordinate pairs in from a data file of unknown length. Each line in the file must contain both coordinates, separated by whitespace, as shown here. In addition, you must use functions in this project, splitting the work up into smaller components and reinforcing your skills with parameter passing and lists.   You are required to create the following functions:   # Role Function’s Objective Input Parameters Output Return Values 1 Input Read the input file and store the x- and y-coordinates in parallel lists N/A N/A -List of x-coordinates. -List of y-coordinates. 2 Process Compute the mean of the data set -List of data N/A -The mean of the data in the list 3 Process Compute the standard deviation of the data set. -List of data -The mean of the data N/A -The standard deviation of the data in the list 4 Process Compute the correlation coefficient. -Call the mean and Standard Deviation function, where needed. -List of x-coordinates -List of y-coordinates N/A -The correlation coefficient of the input lists 5 Process Compute the slope. -Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed. -List of x-coordinates -List of y-coordinates N/A -The slope of the line 6 Process Compute the y-intercept. -Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed. -List of x-coordinates -List of y-coordinates N/A -The y-intercept of the line 7 Output Display a mathematical representation of a line to the screen. -The y-intercept of the line -the slope of the line The y-intercept and slope of the line N/A   Use the following code as the main and used for testing:   if __name__ == "__main__":     #Expected Results from calculations     expectedSlopes = [-0.59, 3884.98, -6.24]     expectYIntercepts = [1173.21, -25433.81, 152.06]     resultCount = 0        #Used for looping through results         #Rename these files to be your 3 input files, may need full path     for f in ["data1.txt", "data2.txt", "data3.txt"]:         #Read in data         x_vals, y_vals = readfile(f)         #Calculate the Slope         slope = round(calcSlope(x_vals, y_vals), 2)         assert slope == expectedSlopes[resultCount], "Got {} but expected slope of {}".format(slope, expectedSlopes[resultCount])         #Calculate the Y Intercept         y_int = round(calcYint(x_vals, y_vals, slope), 2)         assert y_int == expectYIntercepts[resultCount], "Got {} but expected Y Intercept of {}".format(y_int, expectYIntercepts[resultCount])         #Output the Regression Line         output_line(y_int, slope)           resultCount+=1   NOTE: The statistics module can be used to find the mean and standard deviation.   Sample Screen Output Regression line: y = 1166.93 + -0.586788x   Testing: When you are finished, test your program with four different input files: Data File 1 Data File 2 (density in pounds per cubic foot vs. stiffness in pounds per square inch of particleboards; taken from p. 391 of Probability and Statistics for Scientists and Engineers, 6th ed., Walpole/Myers/Myers) Data File 3 (daily rainfall in 0.01 cm vs. air pollution particulate removed in mcg/cum; taken from p. 365 of Walpole/Myers/Myers) A data file you've created yourself. Ideally this will be something in the context of your major. Provide information on where the data came from.   What to Submit: The code Sample runs for each Data File. Data File 4 and a brief description of where you found it and what the data represents.

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Theoretical Overview

Suppose we have a set of data consisting of ordered pairs and we suspect the x and y coordinates are related. It is natural to try to find the best line that fits the data points. If we can find this line, then we can use it to make all sorts of other predictions. In this project, we're going to use several functions to find this line using a technique called least squares regression. The result will be what we call the least squares regression line (or LSRL for short).

 

In order to do this, you'll need to program a statistical computation called the correlation coefficient, denoted by r in statistical symbols:

 

 

 

NOTE: Equation is written assuming you start at the value 1.  Lists start at index 0.

 

Once you have the correlation coefficient, you use it along with the sample means and sample standard deviations of the x and y-coordinates to compute the slope and y-intercept of your regression line via these formulas:

 

 

Tasks:

In this project, you must read the x- and y-coordinate pairs in from a data file of unknown length. Each line in the file must contain both coordinates, separated by whitespace, as shown here. In addition, you must use functions in this project, splitting the work up into smaller components and reinforcing your skills with parameter passing and lists.

 

You are required to create the following functions:

 

#

Role

Function’s Objective

Input Parameters

Output

Return Values

1

Input

Read the input file and store the x- and y-coordinates in parallel lists

N/A

N/A

-List of x-coordinates.

-List of y-coordinates.

2

Process

Compute the mean of the data set

-List of data

N/A

-The mean of the data in the list

3

Process

Compute the standard deviation of the data set.

-List of data

-The mean of the data

N/A

-The standard deviation of the data in the list

4

Process

Compute the correlation coefficient.

-Call the mean and Standard Deviation function, where needed.

-List of x-coordinates

-List of y-coordinates

N/A

-The correlation coefficient of the input lists

5

Process

Compute the slope.

-Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed.

-List of x-coordinates

-List of y-coordinates

N/A

-The slope of the line

6

Process

Compute the y-intercept.

-Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed.

-List of x-coordinates

-List of y-coordinates

N/A

-The y-intercept of the line

7

Output

Display a mathematical representation of a line to the screen.

-The y-intercept of the line

-the slope of the line

The y-intercept and slope of the line

N/A

 

Use the following code as the main and used for testing:

 

if __name__ == "__main__":

    #Expected Results from calculations

    expectedSlopes = [-0.59, 3884.98, -6.24]

    expectYIntercepts = [1173.21, -25433.81, 152.06]

    resultCount = 0        #Used for looping through results

   

    #Rename these files to be your 3 input files, may need full path

    for f in ["data1.txt", "data2.txt", "data3.txt"]:

        #Read in data

        x_vals, y_vals = readfile(f)

        #Calculate the Slope

        slope = round(calcSlope(x_vals, y_vals), 2)

        assert slope == expectedSlopes[resultCount], "Got {} but expected slope of {}".format(slope, expectedSlopes[resultCount])

        #Calculate the Y Intercept

        y_int = round(calcYint(x_vals, y_vals, slope), 2)

        assert y_int == expectYIntercepts[resultCount], "Got {} but expected Y Intercept of {}".format(y_int, expectYIntercepts[resultCount])

        #Output the Regression Line

        output_line(y_int, slope)

 

        resultCount+=1

 

NOTE: The statistics module can be used to find the mean and standard deviation.  

Sample Screen Output

Regression line: y = 1166.93 + -0.586788x

 

Testing:

When you are finished, test your program with four different input files:

  • Data File 1
  • Data File 2 (density in pounds per cubic foot vs. stiffness in pounds per square inch of particleboards; taken from p. 391 of Probability and Statistics for Scientists and Engineers, 6th ed., Walpole/Myers/Myers)
  • Data File 3 (daily rainfall in 0.01 cm vs. air pollution particulate removed in mcg/cum; taken from p. 365 of Walpole/Myers/Myers)
  • A data file you've created yourself. Ideally this will be something in the context of your major. Provide information on where the data came from.

 

What to Submit:

  1. The code
  2. Sample runs for each Data File.
  3. Data File 4 and a brief description of where you found it and what the data represents.
Expert Solution
steps

Step by step

Solved in 2 steps with 2 images

Blurred answer
Knowledge Booster
Time complexity
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education