Assignment2_pdf_merged
.pdf
keyboard_arrow_up
School
University of California, San Diego *
*We aren’t endorsed by this school
Course
176
Subject
Computer Science
Date
May 14, 2024
Type
Pages
26
Uploaded by MasterJellyfish4274 on coursehero.com
logistic_regression
January 28, 2024
1
ECE 285 Assignment 2: Logistic Regression
For this part of assignment, you are tasked to implement a logistic regression algorithm for multi-
class classification and test it on the CIFAR10 dataset.
You sould run the whole notebook and answer the questions in the notebook.
TO SUBMIT: PDF of this notebook with all the required outputs and answers.
[1]:
# Prepare Packages
import
numpy
as
np
import
matplotlib.pyplot
as
plt
from
utils.data_processing
import
get_cifar10_data
# Use a subset of CIFAR10 for KNN assignments
dataset
=
get_cifar10_data(
subset_train
=5000
,
subset_val
=250
,
subset_test
=500
,
)
print
(dataset
.
keys())
print
(
"Training Set Data
Shape: "
, dataset[
"x_train"
]
.
shape)
print
(
"Training Set Label Shape: "
, dataset[
"y_train"
]
.
shape)
print
(
"Validation Set Data
Shape: "
, dataset[
"x_val"
]
.
shape)
print
(
"Validation Set Label Shape: "
, dataset[
"y_val"
]
.
shape)
print
(
"Test Set Data
Shape: "
, dataset[
"x_test"
]
.
shape)
print
(
"Test Set Label Shape: "
, dataset[
"y_test"
]
.
shape)
dict_keys(['x_train', 'y_train', 'x_val', 'y_val', 'x_test', 'y_test'])
Training Set Data
Shape:
(5000, 3072)
Training Set Label Shape:
(5000,)
Validation Set Data
Shape:
(250, 3072)
Validation Set Label Shape:
(250,)
Test Set Data
Shape:
(500, 3072)
Test Set Label Shape:
(500,)
1
2
Logistic Regression for multi-class classification
A Logistic Regression Algorithm has these hyperparameters:
•
Learning rate
- controls how much we change the current weights of the classifier during
each update. We set it at a default value of 0.5, and later you are asked to experiment with
different values. We recommend looking at the graphs and observing how the performance of
the classifier changes with different learning rate.
•
Number of Epochs
- An epoch is a complete iterative pass over all of the data in the
dataset. During an epoch we predict a label using the classifier and then update the weights
of the classifier according the linear classifier update rule for each sample in the training set.
We evaluate our models after every 10 epochs and save the accuracies, which are later used
to plot the training, validation and test VS epoch curves.
•
Weight Decay
- Regularization can be used to constrain the weights of the classifier and
prevent their values from blowing up. Regularization helps in combatting overfitting. You
will be using the ‘weight_decay’ term to introduce regularization in the classifier.
The only way how a Logistic Regression based classification algorithm is different from a Linear
Regression algorithm is that in the former we additionally pass the classifier outputs into a sigmoid
function which squashes the output in the (0,1) range. Essentially these values then represent the
probabilities of that sample belonging to class particular classes
2.0.1
Implementation (40%)
You need to implement the Linear Regression method in
algorithms/logistic_regression.py
.
The formulations follow the lecture (consider binary classification for each of the 10 classes, with
labels -1 / 1 for not belonging / belonging to the class). You need to fill in the training function as
well as the prediction function. You need to fill in the sigmoid function, training function as well
as the prediction function.
[2]:
# Import the algorithm implementation (TODO: Complete the Logistic Regression
␣
↪
in algorithms/logistic_regression.py)
from
algorithms
import
Logistic
from
utils.evaluation
import
get_classification_accuracy
num_classes
= 10
# Cifar10 dataset has 10 different classes
# Initialize hyper-parameters
learning_rate
= 0.01
# You will be later asked to experiment with different
␣
↪
learning rates and report results
num_epochs_total
= 200
# Total number of epochs to train the classifier
epochs_per_evaluation
= 10
# Epochs per step of evaluation; We will evaluate
␣
↪
our model regularly during training
N, D
=
dataset[
"x_train"
]
.
shape
# Get training data shape, N: Number of examples, D:Dimensionality of
␣
↪
the data
weight_decay
= 0.00002
2
x_train
=
dataset[
"x_train"
]
.
copy()
y_train
=
dataset[
"y_train"
]
.
copy()
x_val
=
dataset[
"x_val"
]
.
copy()
y_val
=
dataset[
"y_val"
]
.
copy()
x_test
=
dataset[
"x_test"
]
.
copy()
y_test
=
dataset[
"y_test"
]
.
copy()
# Insert additional scalar term 1 in the samples to account for the bias as
␣
↪
discussed in class
x_train
=
np
.
insert(x_train, D, values
=1
, axis
=1
)
x_val
=
np
.
insert(x_val, D, values
=1
, axis
=1
)
x_test
=
np
.
insert(x_test, D, values
=1
, axis
=1
)
[3]:
# Training and evaluation function -> Outputs accuracy data
def
train
(learning_rate_, weight_decay_):
# Create a linear regression object
logistic_regression
=
Logistic(
num_classes, learning_rate_, epochs_per_evaluation, weight_decay_
)
# Randomly initialize the weights and biases
weights
=
np
.
random
.
randn(num_classes, D
+ 1
)
* 0.0001
train_accuracies, val_accuracies, test_accuracies
=
[], [], []
# Train the classifier
for
_
in
range
(
int
(num_epochs_total
/
epochs_per_evaluation)):
# Train the classifier on the training data
weights
=
logistic_regression
.
train(x_train, y_train, weights)
# Evaluate the trained classifier on the training dataset
y_pred_train
=
logistic_regression
.
predict(x_train)
train_accuracies
.
append(get_classification_accuracy(y_pred_train,
␣
↪
y_train))
# Evaluate the trained classifier on the validation dataset
y_pred_val
=
logistic_regression
.
predict(x_val)
val_accuracies
.
append(get_classification_accuracy(y_pred_val, y_val))
# Evaluate the trained classifier on the test dataset
y_pred_test
=
logistic_regression
.
predict(x_test)
test_accuracies
.
append(get_classification_accuracy(y_pred_test, y_test))
return
train_accuracies, val_accuracies, test_accuracies, weights
3
[4]:
import
matplotlib.pyplot
as
plt
def
plot_accuracies
(train_acc, val_acc, test_acc):
# Plot Accuracies vs Epochs graph for all the three
epochs
=
np
.
arange(
0
,
int
(num_epochs_total
/
epochs_per_evaluation))
plt
.
ylabel(
"Accuracy"
)
plt
.
xlabel(
"Epoch/10"
)
plt
.
plot(epochs, train_acc, epochs, val_acc, epochs, test_acc)
plt
.
legend([
"Training"
,
"Validation"
,
"Testing"
])
plt
.
show()
[5]:
# Run training and plotting for default parameter values as mentioned above
t_ac, v_ac, te_ac, weights
=
train(learning_rate, weight_decay)
[6]:
plot_accuracies(t_ac, v_ac, te_ac)
print
(
"Logistic Regression"
)
Logistic Regression
4
2.0.2
Try different learning rates and plot graphs for all (20%)
[7]:
# Initialize the best values
best_weights
=
weights
best_learning_rate
=
learning_rate
best_weight_decay
=
weight_decay
# TODO
# Repeat the above training and evaluation steps for the following learning
␣
↪
rates and plot graphs
# You need to try 3 learning rates and submit all 3 graphs along with this
␣
↪
notebook pdf to show your learning rate experiments
learning_rates
=
[
0.01
,
0.1
,
1
]
weight_decay
= 0.0
# No regularization for now
# FEEL FREE TO EXPERIMENT WITH OTHER VALUES. REPORT OTHER VALUES IF THEY
␣
↪
ACHIEVE A BETTER PERFORMANCE
# for lr in learning_rates: Train the classifier and plot data
# Step 1. train_accu, val_accu, test_accu = train(lr, weight_decay)
# Step 2. plot_accuracies(train_accu, val_accu, test_accu)
max_test_accu
= 0
max_val_accu
=0
for
learning_rate
in
learning_rates:
train_accu, val_accu, test_accu, weights
=
train(learning_rate, weight_decay)
plot_accuracies(train_accu, val_accu, test_accu)
if
max_val_accu
<
max
(val_accu):
max_val_accu
=
max
(val_accu)
max_test_accu
=
max
(test_accu)
best_learning_rate
=
learning_rate
best_weights
=
weights
print
(
f"maximum validation accuracy:
{
max_val_accu
}
and test accuracy :
␣
↪
{
max_test_accu
}
at Learning Rate:
{
best_learning_rate
}
"
)
5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Polynomial regression is a form of nonlinear regression that describes nonlinear relationships in a dataset.
There are several advantages to linear regression, mainly high accuracy.
For your assignment, you will build a polynomial regression model in Python.
The data is in a .CSV file that has the following information types.
Position
Level
Salary
Business Analyst
1
45000
Junior Consultant
2
50000
Senior Consultant
3
60000
Manager
4
80000
Country Manager
5
110000
Region Manager
6
150000
Partner
7
200000
Senior Partner
8
300000
C-level
9
500000
CEO
10
1000000
Using this data, our model should be able to predict the value of an employee candidate given their years of experience.
The Python file must demonstrate the prediction of employee salary based on years of experience.
arrow_forward
Use import sys. Use the fastfood.csv file to complete the following assignment. Create a file, fastfood.py, that loads the .csv file and runs a regression predicting calories from total_fat, sat_fat, cholesterol, and sodium, in that order. Add a constant using sm.add_constant(data).
Note: you will not need to upload the .csv to CodeGrade because I have pre-loaded it, but you will need to read in the data.
Then, print the following to two decimals
print(model.mse_total.round(2))
print(model.rsquared.round(2))
print(model.params.round(2))
print(model.pvalues.round(2))
arrow_forward
Write python code to do the followings
1. Read the attached file "Breast_cancer_dataset.co and store all its columns
(except classification) inta a variable (X), and read column "classilication" into a
variable (V). Note that if Classification-1 means patient is Healthy, and
Classification=2 means patient has Breast cancer
2. Use the package below to train a logistic regression model to learn to predict
whether a patient has breast cancer or not using the variables X and Y.
from sklearn.linear model iaport Logistietegression
3 Predict the class of a patient. Choose any patient from the input file
"Breast cancer_dataset.cs,
4. Compute error in whichever way you prefer.
S. Use your model to show the feature/attribute that has the highest impact on
Breast cancer. Print the name of the attribute. Explain your findings in one line.
The assignment is out of 5 marks. Each one of the above points weighs one grade.
Any unnecessary (or extra) lines of code will deduct grades.
arrow_forward
9. Packages outside of base Python never come with any Python installation.
True
False
10. ANOVA is an omnibus test
True
False
11. OLS can be used for regression and ANOVA
True
False
12.The following will compare differences in traffic across days
mulicomp.pairwise_turkeyhsd(df.day, df.trafic)
True
False
13. Statsmodels generally requires ___ matrices
statsmodels doesn't require matrices
4
2
3
14. Match the following
Independent/predictor variables
Dependent/response variables
1.
Endogenous
2.
Exogenous
15. Boxplots are useful in comparing mean differences across groups
True
False
16. Scatterplots are useful for comparing mean differences across groups
True
False
arrow_forward
KNN is a technique used to estimate new values based on the similarity of known ones.
In this assignment, your company wants you to estimate the selling price of a customer's building
The price you calculate will be given to the customer as the company selling price recommendation.
You decide to use Data Science techniques such as the K-Nearest Neighbor.(KNN)
You will need to:
Import the necessary libraries from your program. (You can use the model class sklearn.neighbors.KNeighborsClassifier, part of the package sci-kit-learn 1.1.1 (Links to an external site) or any other.
Train/test the model with the data included in the module (cal_housing.tgz).
The house you need to estimate the value for has the following properties:
longitude: 120.75latitude: 39.34housingMedianAge: 35.5total rooms: 260totalBedrooms:120 population:540households: 12medianIncome:1.8 K BuildingValue: ?
What is the recommended price?
You need to provide the code, properly commented.
You could use…
arrow_forward
Using Pandas library in python - Calculate student grades project
Pandas is a data analysis library built in Python. Pandas can be used in a Python script, a Jupyter Notebook, or even as part of a web application. In this pandas project, you’re going to create a Python script that loads grade data of 5 to 10 students (a .csv file) and calculates their letter grades in the course. The CSV file contains 5 column students' names, score in-class participation (5% of final grade), score in assignments (20% of final grade), score in discussions (20% of final grade), score in the mid term (20% of final grade), score in final (25% of final grade). Create the .csv file as part of the project submission
Program Output
This will happen when the program runs
Enter the CSV file
Student 1 named Brian Tomas has a letter grade of B+
Student 2 named Tom Blank has a letter grade of C
Student 3 named Margo True has a letter grade of A
Student 4 named David Atkin has a letter grade of B+
Student 5 named…
arrow_forward
Use an appropriate scikit-learn library we used in class to create y_train, y_test, X_train and X_test by splitting the data into 70% train and 30% test datasets.
Set random_state to 4 and stratify the subsamples so that train and test datasets have roughly equal proportions of the target's class labels Standardise the data using StandardScaler library
arrow_forward
Please written by computer source
Assignment 4 In this assignment you will be using the dataset released by The Department of Transportation. This dataset lists flights that occurred in 2015, along with other information such as delays, flight time etc.
In this assignment, you will be showing good practices to manipulate data using Python's most popular libraries to accomplish the following:
cleaning data with pandas make specific changes with numpy handling date-related values with datetime Note: please consider the flights departing from BOS, JFK, SFO and LAX.
Each question is equally weighted for the total grade.
import os
import pandas as pd
import pandas.api.types as ptypes
import numpy as np
import datetime as dt
airlines_df= pd.read_csv('assets\airlines.csv')
airports_df = pd.read_csv('assets\airports.csv')
flights_df_raw = pd.read_csv('assets\flights.csv', low_memory = False)
Question 1: Data Preprocessing
For this question, perform the following:
remove rows with…
arrow_forward
Course: Data Mining
Language: R
Suppose that a hospital tested the age and body fat data for 18 randomly selected adults and the results are provided to you as follows,
Ages: 13 15 16 16 19 20 20 21 22 22 25 25 25 25 30 33 33 35 35 35 35 36 40 45 46 52 70
BodyFat: 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
Answer the following:
Normalize the two attributes based on z-score normalization.
Calculate the correlation coefficient (Pearson’s).
Are these two attributes positively or negatively correlated?
Computer their covariance.
*you can use R for calculation. Report all the results in the HW file.
arrow_forward
PYTHON PROGRAMMING
Files are here: http://www.cse.msu.edu/~cse231/Online/Projects/Project05/
This exercise is about data manipulation.
In this project, we will process agricultural data, namely the adoption of different Genetically Modified (GM) crops in the U.S. The data was collected over the years 2000-2016.
In this project, we are interested in how the adoption of different GM food and non-food crops has been proceeding in different states. We are going to determine the minimum and maximum adoption by state and the years when the minimum and maximum occurred.
Assignment Specifications:
The data files The data files that you use are yearly state-wise percentage plantings for each type of crop:
• alltablesGEcrops.csv: the actual data from the USDA.
• data1.csv: data modified to test this project.
• data2.csv: data modified to test this project, but you do not get to see it
Input: This is real data so it is messy and you have to deal with that. Here are the constraints.…
arrow_forward
Sorting objects in the real world https://docs.oracle.com/javase/8/docs/api/java/util/LinkedList.html
There are 5000 people living in the town. Every day they have new COVID-19 cases.
When people show symptom, they go to the hospital and put themselves in the waiting list for testing. A new person is added at the end of the list. Due to the lack of testing kit, all in the list cannot be tested. Hospital has to sort them and select a few. Since the elderly is very weak to the COVID-19, every midnight the doctors sort the people in the list by their age to decide who is taking the test for the next day depending on the availability of testing kit.
Input to the program has the form where the first line indicates how many days they will do the operation. For each day, the input starts with the day number, along with the following patient list where each element represents the name of patients and the age. The input ends with the number of available testing kits.
The output display, at…
arrow_forward
Python Machine Learning
Can you tell me what are the outliners in this dataset?
from sklearn.datasets import load_wine
print(load_wine().DESCR)
X = load_wine().data
y = load_wine().target
f_names = load_wine().feature_names
import pandas as pd
df = pd.DataFrame(X, columns=f_names)
df['TARGET'] = y
df.head()
df['Type'] = y
df.head()
df.info()
df.isna().sum().sum()
df.describe()
import matplotlib.pyplot as plt
df.hist(figsize=(20,12), layout=(4,8))
plt.tight_layout
arrow_forward
Computer Science
Write a python program that reads the data file https://archive.ics.uci.edu/ml/machine-learning-databases/eventdetection/CalIt2.data and finds the total count of outflow and the total count of inflow. The attributes in the file are as follows:
1. Flow ID: 7 is out flow, 9 is in flow
2. Date: MM/DD/YY
3. Time: HH:MM:SS
4. Count: Number of counts reported for the previous half hour Rows: Each half hour time slice is represented by 2 rows: one row for the out flow during that time period (ID=7) and one row for the in flow during that time period (ID=9) Hint: # Importing the dataset dataset = pd.read_csv('CalIt2.data')
https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/CalIt2.data
this link should work.
arrow_forward
Please fill out the remaining codes in #TO-DO for Python:
import numpy as npimport pandas as pd
from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom sklearn.linear_model import LinearRegression
class MyLinearRegression:theta = None
def fit(self, X, y, option, alpha, epoch):X = np.concatenate((np.array(X), np.ones((X.shape[0], 1), dtype=np.float64)), axis=1)y = np.array(y)if option.lower() in ['bgd', 'gd']:# Run batch gradient descent.self.theta = self.batchGradientDescent(X, y, alpha, epoch)elif option.lower() in ['sgd']:# Run stochastic gradient descent.self.theta = self.stocGradientDescent(X, y, alpha, epoch)else:# Run solving the normal equation.self.theta = self.normalEquation(X, y)def predict(self, X):X = np.concatenate((np.array(X), np.ones((X.shape[0], 1), dtype=np.float64)), axis=1)if isinstance(self.theta, np.ndarray):# TO-DO: Implement predict().
return y_predreturn None
def…
arrow_forward
2 CVS uses a simple text-based rule to identify overlaps during a merge: There is an overlap if the same line was
changed in both versions that are being merged. If no such line exists, then CVS decides there is no conflict and
the versions are merged automatically. For example, assume a file contains a class with three methods, a().
b(), and c(). Two developers work independently on the file. If they both modify the same lines of code, say the
first line of method a(), then CVS decides there is a conflict. Explain why this approach will fail to detect
certain types of conflict. Provide an example in your answer.
arrow_forward
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. In this
section, we will be looking at some of the preprocessing steps involved with analysing data.
We will be using the 'cars.csv' dataset [ this csv file is provided inside your lab zipfile download ]
Dataset resource: https://www.kaggle.com/abineshkumark/carsdata
About the dataset: Cars Data has Information about 3 brands/make of cars. Namely US, Japan, Europe. Target of the data set to find the brand of a car using the parameters such as horsepower, Cubic inches, Make year,
etc.
Look out for the 'YOUR CODE HERE:' comment in the following cells
3.a) Loading a .csv file ( 2 points)
# Loading a .csv file
# Use a pandas fuction to read a csv file (cars.csv)
# Store the csv file as a pandas dataframe called 'df'
# YOUR CODE HERE
Python
# Viewing a sample of the dataframe…
arrow_forward
Storing tabular data as pandas dataframe:
Data preprocessing is one of the steps in machine learning. The pandas library in python is suitable to deal with tabular data. Create a variable ‘emissions’ and assign to it the following data (Table 1) as padas DataFrame. Create an excel file ‘emissions_from_pandas.xlsx’ from the ‘emissions’ variable using python.
Table 1. Particulate matter (PM) emissions (in g/gal) for 15 vehicles driven at low altitude and another 15 vehicles driven at high altitude.
Low Altitude
High Altitude
1.50
7.59
1.48
2.06
2.98
8.86
1.40
8.67
3.12
5.61
0.25
6.28
6.73
4.04
5.30
4.40
9.30
9.52
6.96
1.50
7.21
6.07
0.87
17.11
1.06
3.57
7.39
2.68
1.37
6.46
iloc[] method:
(b) Using the .iloc[] method, we can access any part of the dataframe. Run the following commands and show the outputs:
emissions.head()
emissions.iloc[0,0]
emissions.iloc[1,1]
emissions.iloc[0:2,0:2]…
arrow_forward
Write a script or a program that reads a text file containing a TF-IDF weights matrix defined in milestone
4, and two additional parameters that are documents' identifiers. Your program should return the cosine
similarity value of those two documents.
sample command to run: python cosine_similarity.py D1 D2
sample output: 0.6 (this is just a number)
arrow_forward
Declare a struct Employee with fields Name as C-String, eID as int, Saraly as double (with eID as key). Suppose you have a file Salary.dat which has some number of records in it (of type Employee). Now read a key from user and modify its matching record's name with zahid iqbal
arrow_forward
Study the scenario and complete the question(s) that follow:
Suppose that there are items that need to be processed, in the order which they are provided. There are times that some of the packages need to be chosen to be processed first
Source: Mbela, K. (2020),
Create a method that allows the user to choose the package according to its element number and prioritises it. Make sure that there is correct error handling in your function.
arrow_forward
TODO: Polynomial Regression with Ordinary Least Squares (OLS) and Regularization
*Please complete the TODOs. *
!pip install wget
import osimport randomimport tracebackfrom pdb import set_traceimport sysimport numpy as npfrom abc import ABC, abstractmethodimport traceback
from util.timer import Timerfrom util.data import split_data, feature_label_split, Standardizationfrom util.metrics import msefrom datasets.HousingDataset import HousingDataset
class BaseModel(ABC): """ Super class for ITCS Machine Learning Class"""
@abstractmethod def fit(self, X, y): pass
@abstractmethod def predict(self, X): pass
class LinearModel(BaseModel): """ Abstract class for a linear model
Attributes ========== w ndarray weight vector/matrix """
def __init__(self): """ weight vector w is initialized as None """ self.w = None
# check if the matrix is 2-dimensional. if not, raise an…
arrow_forward
the below is an example of diabetes dataset
import matplotlib.pyplot as pltimport numpy as npfrom sklearn.datasets import load_diabetesfrom sklearn import linear_model
d = load_diabetes()d_X = d.data[:, np.newaxis, 2]dx_train = d_X[:-20]dy_train = d.target[:-20]dx_test = d_X[-20:]dy_test = d.target[-20:]
lr = linear_model.LinearRegression()lr.fit(dx_train, dy_train)
mse = np.mean((lr.predict(dx_test) - dy_test) **2)lr_score = lr.score(dx_test, dy_test)
print(lr.coef_)print(mse)print(lr_score)plt.scatter(dx_test, dy_test)plt.plot(dx_test, lr.predict(dx_test), c='r')plt.show()
arrow_forward
astfoodStats Assignment Description
For this assignment, name your R file fastfoodStats.R
For all questions you should load tidyverse, openintro, and lm.beta. You should not need to use any other libraries.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(openintro))
suppressPackageStartupMessages(library(lm.beta))
The actual data set is called fastfood.
Continue to use %>% for the pipe. CodeGrade does not support the new pipe.
Round all float/dbl values to two decimal places.
All statistics should be run with variables in the order I state
E.g., "Run a regression predicting mileage from mpg, make, and type" would be:
lm(mileage ~ mpg + make + type...)
To access the fastfood data, run the following:
fastfood <- openintro::fastfood
Create a correlation matrix for the relations between calories, total_fat, sugar, and calcium for all items at Sonic, Subway, and Taco Bell, omitting missing values with na.omit().
Assign the…
arrow_forward
Please send me answer within 10 mins I will upvote your answer for must.
Python: Run a paired bootstrap test to compare the means e1 and e2. I was able to make a bootstrap test to find whether or not two datasets have significantly different means but I'm not sure what to add to make it a paired bootstrap. The paired bootstrap should look exactly like this but with one added component. What would need to be added to make it a paired bootstrap?
e1_2 = e1_sub + e1_e2_com
e2_2 = e2_sub + e1_e2_com
# create array to hold bootstrap mean differences
nbootstraps = 10000
bs_mean_diffs = np.zeros(nbootstraps)
# take bootstrap samples many times
for ii in range(nbootstraps):
# choose which indices will be used from e1_2 and e2_2
inds1 = np.random.randint(0,len(e1))
inds2 = np.random.randint(0,len(e2))
# create your bootstrap samples
bs_e1 = e1_2[inds1]
bs_e2 = e2_2[inds2]
# measure their difference and store it
bs_mean_diffs[ii] = bs_e1.mean()- bs_e2.mean()
# take the absolute value of…
arrow_forward
In The Readline Technique, on page 179, you learned how to read somefiles from the Time Series Data Library. In particular, you learned aboutthe Hopedale data set, which describes the number of colored fox fur peltsproduced from 1834 to 1842. This file contains one value per year perline.a. Write an outline in English of the algorithm you would use to readthe values from this data set to compute the average number of peltsproduced per year.b. Translate your algorithm into Python by writing a function namedhopedale_average that takes a filename as a parameter and returns theaverage number of pelts produced per year.
arrow_forward
"In this exercise we will use code that we already have. Use the fucntion you created to compute the precision of a model, remember that the precision of a model is:$precision=(tp/(tp+fp))$ where tp is the number of true positives and fp is the number of false positives. The file performance.txt has data from models that were used in an experiment, the data has the model id, number of true positives and number of false positives for each model. Remember again that the model is good if its precision is above 0.75.\n", "\n", "Open the file, read each model and print if the model is good or bad. At the end, print the id of the best model and its precision. Be sure to use a function to compute the precision of each model."
arrow_forward
Octave assignment 1-Introduction to Octave
Introduction
This assignment is meant to be a gentle introduction to Octave, the free version of Matlab. It
assumes that you have no prior coding experience.
Objectives
Download Octave and run it or use https://octave-online.net/
Learn the basics of the Octave GUI.
• Learn how to create a short executable file called an m-file (.m extension) and run it.
• Learn what a data type is.
• Learn how to declare variables of different data types.
• Learn how to create matrices.
• Learn how to use several of Octave’s functions for creating objects with random
numbers.
Instructions
1.) Create a file call it with the form exercise_1_first name_last name. Include the
underscores in your file name. At the top of the file add the comment "“My first Octave
assignment. I'm so excited, I just can't hide it."
2.) Create the following variables
a = 2.3;
b = -87.3;
A = [1,2; 4,5];
Create a matrix 2 × 2 B using the rand() function.
Create two random complex…
arrow_forward
A dataset in R has a column "icecream" whose responses are a factor with 3 levels: "strawberry," "chocolate," and "vanilla" (1, 2, and 3 respectively). We want to create a new column that simply classifies chocolate or non-chocolate (1 for chocolate, 0 for non-chocolate). How do we code this?
arrow_forward
Consider the following a file (movies.txt) with the following list of movies (comma separated list).
SNO, Name, NoOfPeopleLiked
1, The Shawshank Redemption, 77
2, The Godfather, 20
3, Into The Wild, 35
4, The Dark Knight, 55
5, 12 Angry Men, 44
6, Schindler's List, 33
7, The Lord of the Rings: The Return of the King, 25
8, Pulp Fiction, 23
9, The Good, the Bad and the Ugly, 32
10, The Lord of the Rings: The Fellowship of the Ring, 56
arrow_forward
There is a famous dataset in R called "iris." It should already be loaded# in R for you. If you type in > ?iris you can see some documentation. Familiarize# yourself with this dataset.
# Now obtain the mean of the first 4 variables, by species, but# using only one function call.
#use R studio
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- Polynomial regression is a form of nonlinear regression that describes nonlinear relationships in a dataset. There are several advantages to linear regression, mainly high accuracy. For your assignment, you will build a polynomial regression model in Python. The data is in a .CSV file that has the following information types. Position Level Salary Business Analyst 1 45000 Junior Consultant 2 50000 Senior Consultant 3 60000 Manager 4 80000 Country Manager 5 110000 Region Manager 6 150000 Partner 7 200000 Senior Partner 8 300000 C-level 9 500000 CEO 10 1000000 Using this data, our model should be able to predict the value of an employee candidate given their years of experience. The Python file must demonstrate the prediction of employee salary based on years of experience.arrow_forwardUse import sys. Use the fastfood.csv file to complete the following assignment. Create a file, fastfood.py, that loads the .csv file and runs a regression predicting calories from total_fat, sat_fat, cholesterol, and sodium, in that order. Add a constant using sm.add_constant(data). Note: you will not need to upload the .csv to CodeGrade because I have pre-loaded it, but you will need to read in the data. Then, print the following to two decimals print(model.mse_total.round(2)) print(model.rsquared.round(2)) print(model.params.round(2)) print(model.pvalues.round(2))arrow_forwardWrite python code to do the followings 1. Read the attached file "Breast_cancer_dataset.co and store all its columns (except classification) inta a variable (X), and read column "classilication" into a variable (V). Note that if Classification-1 means patient is Healthy, and Classification=2 means patient has Breast cancer 2. Use the package below to train a logistic regression model to learn to predict whether a patient has breast cancer or not using the variables X and Y. from sklearn.linear model iaport Logistietegression 3 Predict the class of a patient. Choose any patient from the input file "Breast cancer_dataset.cs, 4. Compute error in whichever way you prefer. S. Use your model to show the feature/attribute that has the highest impact on Breast cancer. Print the name of the attribute. Explain your findings in one line. The assignment is out of 5 marks. Each one of the above points weighs one grade. Any unnecessary (or extra) lines of code will deduct grades.arrow_forward
- 9. Packages outside of base Python never come with any Python installation. True False 10. ANOVA is an omnibus test True False 11. OLS can be used for regression and ANOVA True False 12.The following will compare differences in traffic across days mulicomp.pairwise_turkeyhsd(df.day, df.trafic) True False 13. Statsmodels generally requires ___ matrices statsmodels doesn't require matrices 4 2 3 14. Match the following Independent/predictor variables Dependent/response variables 1. Endogenous 2. Exogenous 15. Boxplots are useful in comparing mean differences across groups True False 16. Scatterplots are useful for comparing mean differences across groups True Falsearrow_forwardKNN is a technique used to estimate new values based on the similarity of known ones. In this assignment, your company wants you to estimate the selling price of a customer's building The price you calculate will be given to the customer as the company selling price recommendation. You decide to use Data Science techniques such as the K-Nearest Neighbor.(KNN) You will need to: Import the necessary libraries from your program. (You can use the model class sklearn.neighbors.KNeighborsClassifier, part of the package sci-kit-learn 1.1.1 (Links to an external site) or any other. Train/test the model with the data included in the module (cal_housing.tgz). The house you need to estimate the value for has the following properties: longitude: 120.75latitude: 39.34housingMedianAge: 35.5total rooms: 260totalBedrooms:120 population:540households: 12medianIncome:1.8 K BuildingValue: ? What is the recommended price? You need to provide the code, properly commented. You could use…arrow_forwardUsing Pandas library in python - Calculate student grades project Pandas is a data analysis library built in Python. Pandas can be used in a Python script, a Jupyter Notebook, or even as part of a web application. In this pandas project, you’re going to create a Python script that loads grade data of 5 to 10 students (a .csv file) and calculates their letter grades in the course. The CSV file contains 5 column students' names, score in-class participation (5% of final grade), score in assignments (20% of final grade), score in discussions (20% of final grade), score in the mid term (20% of final grade), score in final (25% of final grade). Create the .csv file as part of the project submission Program Output This will happen when the program runs Enter the CSV file Student 1 named Brian Tomas has a letter grade of B+ Student 2 named Tom Blank has a letter grade of C Student 3 named Margo True has a letter grade of A Student 4 named David Atkin has a letter grade of B+ Student 5 named…arrow_forward
- Use an appropriate scikit-learn library we used in class to create y_train, y_test, X_train and X_test by splitting the data into 70% train and 30% test datasets. Set random_state to 4 and stratify the subsamples so that train and test datasets have roughly equal proportions of the target's class labels Standardise the data using StandardScaler libraryarrow_forwardPlease written by computer source Assignment 4 In this assignment you will be using the dataset released by The Department of Transportation. This dataset lists flights that occurred in 2015, along with other information such as delays, flight time etc. In this assignment, you will be showing good practices to manipulate data using Python's most popular libraries to accomplish the following: cleaning data with pandas make specific changes with numpy handling date-related values with datetime Note: please consider the flights departing from BOS, JFK, SFO and LAX. Each question is equally weighted for the total grade. import os import pandas as pd import pandas.api.types as ptypes import numpy as np import datetime as dt airlines_df= pd.read_csv('assets\airlines.csv') airports_df = pd.read_csv('assets\airports.csv') flights_df_raw = pd.read_csv('assets\flights.csv', low_memory = False) Question 1: Data Preprocessing For this question, perform the following: remove rows with…arrow_forwardCourse: Data Mining Language: R Suppose that a hospital tested the age and body fat data for 18 randomly selected adults and the results are provided to you as follows, Ages: 13 15 16 16 19 20 20 21 22 22 25 25 25 25 30 33 33 35 35 35 35 36 40 45 46 52 70 BodyFat: 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7 Answer the following: Normalize the two attributes based on z-score normalization. Calculate the correlation coefficient (Pearson’s). Are these two attributes positively or negatively correlated? Computer their covariance. *you can use R for calculation. Report all the results in the HW file.arrow_forward
- PYTHON PROGRAMMING Files are here: http://www.cse.msu.edu/~cse231/Online/Projects/Project05/ This exercise is about data manipulation. In this project, we will process agricultural data, namely the adoption of different Genetically Modified (GM) crops in the U.S. The data was collected over the years 2000-2016. In this project, we are interested in how the adoption of different GM food and non-food crops has been proceeding in different states. We are going to determine the minimum and maximum adoption by state and the years when the minimum and maximum occurred. Assignment Specifications: The data files The data files that you use are yearly state-wise percentage plantings for each type of crop: • alltablesGEcrops.csv: the actual data from the USDA. • data1.csv: data modified to test this project. • data2.csv: data modified to test this project, but you do not get to see it Input: This is real data so it is messy and you have to deal with that. Here are the constraints.…arrow_forwardSorting objects in the real world https://docs.oracle.com/javase/8/docs/api/java/util/LinkedList.html There are 5000 people living in the town. Every day they have new COVID-19 cases. When people show symptom, they go to the hospital and put themselves in the waiting list for testing. A new person is added at the end of the list. Due to the lack of testing kit, all in the list cannot be tested. Hospital has to sort them and select a few. Since the elderly is very weak to the COVID-19, every midnight the doctors sort the people in the list by their age to decide who is taking the test for the next day depending on the availability of testing kit. Input to the program has the form where the first line indicates how many days they will do the operation. For each day, the input starts with the day number, along with the following patient list where each element represents the name of patients and the age. The input ends with the number of available testing kits. The output display, at…arrow_forwardPython Machine Learning Can you tell me what are the outliners in this dataset? from sklearn.datasets import load_wine print(load_wine().DESCR) X = load_wine().data y = load_wine().target f_names = load_wine().feature_names import pandas as pd df = pd.DataFrame(X, columns=f_names) df['TARGET'] = y df.head() df['Type'] = y df.head() df.info() df.isna().sum().sum() df.describe() import matplotlib.pyplot as plt df.hist(figsize=(20,12), layout=(4,8)) plt.tight_layoutarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education