HW 1 Submission
docx
keyboard_arrow_up
School
Georgia Institute Of Technology *
*We aren’t endorsed by this school
Course
6501
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
6
Uploaded by CaptainCoyoteMaster1037
Question 2.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a
classification model would be appropriate. List some (up to 5) predictors that you might use.
I work at a local Walmart stocking meat and produce. I believe a classification model would be
appropriate for sorting the stock in the backroom in the Walmart location. Currently, space lacks
the proper organization of meats or produce. Having a classification model can help determine
where each item should be placed. Predictors that would be most useful would include Type of
meat or produce, Expiration date, weight, Quantity in box, open vs. unopened. These predictors
will help determine where exactly to position the stock in the backroom at Walmart. The location
will be based on a numbering system, such as 999/999/999 (section/shelf/ slot on shelf). Currently
there are 25 sections in total. The classification model will utilize the predictors above to classify
the stock into the specific section.
Question 2.2
The files
credit_card_data.txt
(without headers) and
credit_card_data-headers.txt
(with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables.
It
has anonymized credit card applications with a binary response variable (last column) indicating if the
application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine
Learning Repository (
https://archive.ics.uci.edu/ml/datasets/Credit+Approval
) without the categorical
variables and without data points that have missing values.
1.
Using the support vector machine function
ksvm
contained in the R package
kernlab
, find a
good classifier for this data. Show the equation of your classifier, and how well it classifies the
data points in the full data set.
(Don’t worry about test/validation data yet; we’ll cover that topic
soon.)
When running the model, I wanted to understand the meaning of the C value for ksvm
because it could have an impact on the accuracy of the model. My understanding of C- Value is
that it tells the SVM optimization how much we want to avoid misclassifying each training
point. I believe that a lower C- value will produce a wider margin hyperplane, that could lead
to misclassifying some points. While a larger C-value will produce a smaller margin hyperplane
that does a substantially better job at classifying all training points correctly. My initial
assumption would be that a higher c-value would be better, because we would like to be able
to more accurately predict if someone should or should not get a credit card. Giving a credit
card to someone that should not receive one could be bad.
When running the model, I wanted to test a wide range of C-values to see what the accuracy
would be like. I utilized a loop function (for loop) to test C- value from 1 – 501, with a scale of
100. What I noticed from all iterations, is that the accuracy remains the same (accuracy =
86.39). The only difference I noticed in the model is the formula. [Reference Appendix 1]
The next step I tried was a significantly smaller C value range and a significantly larger C values
range. When utilizing a smaller range of c-value (.1-5) at a scale of .1 is still reported the same
accuracy score (accuracy = 86.39). [Reference appendix 2]
When I tried a larger range of c-value (1-5000) at a scale of 1000, I reported a slightly lower
score of 86.23. When going to a significantly larger c-value then we can see the score drop.
[Reference appendix 3]
At this stage, I am going to try a c- value range of 500-1000 with a scale of 100. As soon as the
c-vale reaches 600, the accuracy drops a bit to 86.23. The highest I would even set the c-value
score to is 500. [Reference appendix 4]
Now I am going to be more focused on testing specific points to see what predictions my
model makes, as stated in the instructions, “if C is too large or too small, they’ll almost all be
the same (all zero or all one) and the predictive value of the model will be poor.”
After looking at various c-values in the range of 1 – 500, It seems that the predictions are all
the same. This could be because we are utilizing the same data as training and testing.
After further experimentation, I have selected a c-value of 100. It still produces this higher
accuracy of 86.39 and has a balance of 0 and 1 predictions and matches my assumption I made
at the beginning (selecting a higher c-value to predict more correctly). Formula listed below:
Credit_card = - [.081] – [.001]A1 – [.001]A2 – [.002]A3 + [.003]A8 + [1.005]A9 – [.003]A10 + [.0002]A11 –
[.001]A12 – [.001]A14 + [.106]A15
2.
You are welcome, but not required, to try other (nonlinear) kernels as well; we’re not covering
them in this course, but they can sometimes be useful and might provide better predictions than
vanilladot
.
I decided to try out three random kernels from the list. I tried the polydot, tanhdot, and
besseldot kernel to see how the accuracy has differed. I maintained the same C-value that I
chose as the optimum for the ksvm model with vanilladot. The polydot model maintained the
same accuracy as the vanilladot kernel. The basseldot reported a higher accuracy of 92.5%.and
the tanhdot reported lower accuracy of 72.2% accuracy.
From the three models, I would say that basseldot kernal would be more accurate with the c-
value set to 100
3.
Using the k-nearest-neighbors classification function
kknn
contained in the R kknn package,
suggest a good value of k, and show how well it classifies that data points in the full data set.
Don’t forget to scale the data (
scale=TRUE
in
kknn
).
I have had previous experience utilizing this model during my bachelor’s program at GSU. I went
through the process of partitioning the data and testing different values of K. What I noticed after
running the model, is that the accuracy is very low in almost every other instance of K from 1-50.
[Please see appendix 5] When I reviewed other students’ results in the discussion post, I had to
think a bit why this could be the case. The biggest difference is that I am partitioning the data into
training and validation, while other students’ approach is different, in the way they are utilizing
the guidance from the TA and instructions that do not require the partitioning of the data.
With the results, K = 5 provides the best KNN prediction of 66.7%. The results showcase that the
model could be underfitting, which means that it is not able to classify the points correctly.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Appendix
(Feel Free to enlarge any images to view better)
Appendix 1:
Appendix 2:
Appendix 3:
Appendix 4:
Appendix 5:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help