Reference_Notebook_Milestone_1_Classification+FINAL

html

School

Rutgers University *

*We aren’t endorsed by this school

Course

700

Subject

Economics

Date

Apr 30, 2024

Type

html

Pages

Uploaded by DeanQuetzalPerson1017

Milestone 1 ¶ Problem Definition ¶ The context: Why is this problem important to solve? The objectives: What is the intended goal? The key questions: What are the key questions that need to be answered? The problem formulation: What is it that we are trying to solve using data science? Data Description: ¶ The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant. • BAD: 1 = Client defaulted on loan, 0 = loan repaid • LOAN: Amount of loan approved. • MORTDUE: Amount due on the existing mortgage. • VALUE: Current value of the property. • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts) • JOB: The type of job that loan applicant has such as manager, self, etc. • YOJ: Years at present job. • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments). • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due). • CLAGE: Age of the oldest credit line in months. • NINQ: Number of recent credit inquiries. • CLNO: Number of existing credit lines. • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow. Important Notes ¶ • This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise. • In the notebook, there are markdowns cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs. • The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code. • All the outputs in the notebook are just for reference and can be different if you follow a different approach. • There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.

Import the necessary libraries ¶ In [109]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_theme() from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score from sklearn import tree from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier import scipy.stats as stats from sklearn.model_selection import GridSearchCV import warnings warnings.filterwarnings('ignore') Read the dataset ¶ In [5]: hm=pd.read_csv("hmeq.csv") In [6]: # Copying data to another variable to avoid any changes to original data data=hm.copy() Print the first and last 5 rows of the dataset ¶ In [7]: # Display first five rows # Remove ___________ and complete the code hm.head() Out[7]: BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE N 0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1. 1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0. 2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1. 3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN N

BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE N 4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0. In [8]: # Display last 5 rows # Remove ___________ and complete the code hm.tail() Out[8]: BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE 5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 Understand the shape of the dataset ¶ In [9]: # Check the shape of the data # Remove ___________ and complete the code print(hm.shape) (5960, 13) Insights dataset has 5901 rows and 13 columns Check the data types of the columns ¶ In [10]: # Check info of the data # Remove ___________ and complete the code hm.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64

Your preview ends here