1

.pdf

School

Agnes Scott College *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by mahzuzarahaman

1 ISyE DDA – Computer Project #1 (Team Project) This is a TEAM project. It is designed for students in a team to help each other in R or Python coding. Only ONE report submission is needed from a team representative. 1. GLM Mini-Project: There are two parts to this task. Problem and Data Bacground: Locate one “GLM-data set” with more than 60 data rows. It is okay to use simulations to generate data. Please use a data set different from the ones presented in past project studies or lecture notes. If any team wants to use existing data, please randomly select 10 data points, and make minor modifications to create a set of new data. Please show your data modification details in your report. Training Data and Testing Data: Please randomly select 10% of the data points serving as “testing data”. Use the other data (excluding the testing data) as the “training data” in the data analysis below. Note that we are training two models, one from Part-A and the other from Part-B. Report : In the report, please provide [1] details of data source, a few rows of training data, steps in selecting the testing data, [2] data analysis steps and results including model, parameter estimates, standard errors, model evaluation metric value (e.g., AIC) for Part-A [3] interpretation and comments for computer printouts besides information offered in [2] [4] calculation of MSE(train-data) = ∑ all training cases (prediction-error) 2 / #Training-Data. MSE(test- data) = ∑ all 15 cases (prediction-error) 2 / #Testing-data. These metrics provides prediction quality of the two models created in Part-A and -B for the ‘training data” and “testing data”; offer comparison comments from the two models

2 via these metrics and the AIC value (from modeling training-data) provided from the computer printouts. [5] Include Part-A’s software codes and steps in using software tools in Part-B in an Appendix. Remarks: <1> If students decide to focus on the logistic regression , besides the above MSEs, please also compute the following %mis-classification errors %MISCLASS. %MISCLASS(training) = #misclassified cases in the training-data / total #training-data, where a misclassification case is defined as follows: (a) If the data Y i = 1, and the prediction E(Y i )_hat = π i _hat is larger than 0.5 (a given “decision threshold”), then there is no misclassification here. Otherwise, record a case of misclassification. Note that this case is misclassifying 1 to 0. (b) If the data Y i = 0, and the prediction E(Y i )_hat = π i _hat is less than 0.5 (a given “decision threshold”), then there is no misclassification here. Otherwise, record a case of misclassification. Note that this case is misclassifying 0 to 1. (c) In some studies, the two cases above for the misclassifications are considered differently. However, to simplify students’ work, let us consider them the same. That is, our total #mis-classification cases count the #mis-classifications from both (a) and (b) situations TOGETHER. <2> %MISCLASS(testing) = #misclassified cases in the testing-data / total #testing-data. ---------------------------------------------------------------------------------------------------------------- Tasks: Part A:

3 Apply GLM regression using a “syntax-based” software (e.g., R, Python, Matlab) to analyze this data set. Part B (This part is new for 2023 Fall) – This part is a joint project with your partner team identified in MP-1 Task B studies. Task #1: Please apply the tool in your MP-1 Task B (Category <1> or <2> according to your team arrangements) to analyze the same data set studied in Part A. Notes: Your TA Charles W. Bauer < cbauer32@gatech.edu > has written instructions and case studies to guide students how to use Azure systems to analyze data. Please check with your TA Jay (Hyen Jay Lee) < hyenjay12@gatech.edu > for how to use AutoML to analyze data. Please see the instruction files in Canvas’ CP-1 directory. Note that AutoML is picky for making mistakes in data import, and tricky to use. Please try them to see whether the instructions are clear. Please send them and copy me for inquiries about using these two AI systems and also for further improvement for excellent instructions. My email address is < JCLU@isye.gatech.edu >. Please include “ISyE DDA CP-1” in the email subject. Remarks for “ required data size ” in using AutoML: <1> AutoML uses the Artificial Neural Network (ANN) to model regression data (including Normal linear regression data, GLM data and Nonlinear regression data). <2> AutoML requires 1000+ data rows (i.e., sample size is larger than 1000) in model training. <3> If your training data’s sample size is not larger than 1000, use the following “perturbation” method to extend your data to fix this issue.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4 [1] Calculate standard deviation from each X-variable one by one. Note that in calculating these standard deviations, combine training and testing data together. We will use the same standard deviations for simulating training and testing data next. [2] Use the following method to “simulate” new data (Y i , X 1i * , X 2i * , …, X pi * ), where (Y i , X 1i , X 2i , …, X pi ), without the stars in the X-variables, is the ORIGINAL DATA at the i- th row, i = 1, 2, …, n = sample size. The X-variables with starts are the “simulated” data obtained as follows. Use Normal distribution with mean X 1i and standard deviation 0.05*sd(X 1i ) to simulate a new x-data, X 1i * for the i -th case and the X 1 variable. Here, “sd(X 1i )” is the standard deviation of the data column for the X 1 variable. Repeat the same process for simulating X 2i * , …, X pi * data for this i-th case. Note that we will assign the outcome Y i to newly simulated data row (X 1i * , X 2i * , …, X pi * ) here . Thus, we obtain a new row of data, (Y i , X 1i * , X 2i * , …, X pi * ). If the sample size for your ORIGINAL TRAINING DATA is 200, FIVE replications for each data row (i.e., 5*200 = 1000) are needed. Thus, repeat the above process for the i -th row of the original training data 5 times. Then, repeat this process for all the ORIGINAL training data rows, i = 1, 2, …, n * = training-data sample size. [3] For expanding the testing data, repeat the same process as described above for the testing data, but use the same standard deviations obtained in Step [1].

5 Task #2: Besides using the MSE(training-data), MSE(testing-data) addressed in Report [4] (see page 1 for details) above to evaluate the prediction quality of constructed model, explore whether the software tool provides any metric for evaluating model quality. If your Y-outcome data are (0 or 1), follow the remarks in page 2 above to calculate %MISCLASS (training) and %MISCLASS (testing). Make comments about these values in your studies. Task #3: Comment on the overall experience (e.g., pros and cons) in using the software tools studied in Part-A and -B to analyze your GLM data. 2. Mini-Project for Implementing Gauss-Newton’s algorithm for NLR : [1] Locate one “NLR-data set” in the field of your interest. It is okay to use simulations to generate the needed data. Please use a data set different from the ones presented in past project studies or lecture notes. If any team wants to use existing data, please randomly select 10 data points, and make minor modifications to create a set of new data. Please show your data modification details in your report. [2] Go through the steps for implementing the Gauss-Newton Method using the MLRs to find the LSE estimates of NLR parameters. Follow lecture-notes to document the NLR- LSE steps . Note that students should NOT employ an NLR software to analyze this data set. The key is to use MLRs for implementing the NLR-LSE estimates and provide NLR predictions. [3] Provide results similar to typical NLR computer printouts, and make interpretations and comments for the results. [4] (This part is new for 2023 Fall) Use NLR function to build a NLR regression model. Compare computing printouts (e.g., model functional form, parameter estimates, model

6 evaluation metric (e.g., standard errors of estimates, AIC) against those obtained in [3] above. Comments on their similarities and differences (if there is any). [4] Please include software codes in an Appendix. ---------------------------------------------------------------------------------------------------------------- Sample Solutions for Problem #1 (GLM): R-Computer Printouts:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version