HW 1 Submission

docx

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6501

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

docx

Pages

6

Uploaded by CaptainCoyoteMaster1037

Report
Question 2.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use. I work at a local Walmart stocking meat and produce. I believe a classification model would be appropriate for sorting the stock in the backroom in the Walmart location. Currently, space lacks the proper organization of meats or produce. Having a classification model can help determine where each item should be placed. Predictors that would be most useful would include Type of meat or produce, Expiration date, weight, Quantity in box, open vs. unopened. These predictors will help determine where exactly to position the stock in the backroom at Walmart. The location will be based on a numbering system, such as 999/999/999 (section/shelf/ slot on shelf). Currently there are 25 sections in total. The classification model will utilize the predictors above to classify the stock into the specific section. Question 2.2 The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository ( https://archive.ics.uci.edu/ml/datasets/Credit+Approval ) without the categorical variables and without data points that have missing values. 1. Using the support vector machine function ksvm contained in the R package kernlab , find a good classifier for this data. Show the equation of your classifier, and how well it classifies the data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic soon.) When running the model, I wanted to understand the meaning of the C value for ksvm because it could have an impact on the accuracy of the model. My understanding of C- Value is that it tells the SVM optimization how much we want to avoid misclassifying each training point. I believe that a lower C- value will produce a wider margin hyperplane, that could lead to misclassifying some points. While a larger C-value will produce a smaller margin hyperplane that does a substantially better job at classifying all training points correctly. My initial assumption would be that a higher c-value would be better, because we would like to be able to more accurately predict if someone should or should not get a credit card. Giving a credit card to someone that should not receive one could be bad. When running the model, I wanted to test a wide range of C-values to see what the accuracy would be like. I utilized a loop function (for loop) to test C- value from 1 – 501, with a scale of 100. What I noticed from all iterations, is that the accuracy remains the same (accuracy = 86.39). The only difference I noticed in the model is the formula. [Reference Appendix 1]
The next step I tried was a significantly smaller C value range and a significantly larger C values range. When utilizing a smaller range of c-value (.1-5) at a scale of .1 is still reported the same accuracy score (accuracy = 86.39). [Reference appendix 2] When I tried a larger range of c-value (1-5000) at a scale of 1000, I reported a slightly lower score of 86.23. When going to a significantly larger c-value then we can see the score drop. [Reference appendix 3] At this stage, I am going to try a c- value range of 500-1000 with a scale of 100. As soon as the c-vale reaches 600, the accuracy drops a bit to 86.23. The highest I would even set the c-value score to is 500. [Reference appendix 4] Now I am going to be more focused on testing specific points to see what predictions my model makes, as stated in the instructions, “if C is too large or too small, they’ll almost all be the same (all zero or all one) and the predictive value of the model will be poor.” After looking at various c-values in the range of 1 – 500, It seems that the predictions are all the same. This could be because we are utilizing the same data as training and testing. After further experimentation, I have selected a c-value of 100. It still produces this higher accuracy of 86.39 and has a balance of 0 and 1 predictions and matches my assumption I made at the beginning (selecting a higher c-value to predict more correctly). Formula listed below: Credit_card = - [.081] – [.001]A1 – [.001]A2 – [.002]A3 + [.003]A8 + [1.005]A9 – [.003]A10 + [.0002]A11 – [.001]A12 – [.001]A14 + [.106]A15 2. You are welcome, but not required, to try other (nonlinear) kernels as well; we’re not covering them in this course, but they can sometimes be useful and might provide better predictions than vanilladot . I decided to try out three random kernels from the list. I tried the polydot, tanhdot, and besseldot kernel to see how the accuracy has differed. I maintained the same C-value that I chose as the optimum for the ksvm model with vanilladot. The polydot model maintained the same accuracy as the vanilladot kernel. The basseldot reported a higher accuracy of 92.5%.and the tanhdot reported lower accuracy of 72.2% accuracy.
From the three models, I would say that basseldot kernal would be more accurate with the c- value set to 100 3. Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data ( scale=TRUE in kknn ). I have had previous experience utilizing this model during my bachelor’s program at GSU. I went through the process of partitioning the data and testing different values of K. What I noticed after running the model, is that the accuracy is very low in almost every other instance of K from 1-50. [Please see appendix 5] When I reviewed other students’ results in the discussion post, I had to think a bit why this could be the case. The biggest difference is that I am partitioning the data into training and validation, while other students’ approach is different, in the way they are utilizing the guidance from the TA and instructions that do not require the partitioning of the data. With the results, K = 5 provides the best KNN prediction of 66.7%. The results showcase that the model could be underfitting, which means that it is not able to classify the points correctly.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Appendix (Feel Free to enlarge any images to view better) Appendix 1: Appendix 2:
Appendix 3: Appendix 4:
Appendix 5:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help