Reference_Notebook_Milestone_1_Classification+FINAL

html

School

Rutgers University *

*We aren’t endorsed by this school

Course

700

Subject

Economics

Date

Apr 30, 2024

Type

html

Pages

37

Uploaded by DeanQuetzalPerson1017

Report
Milestone 1 Problem Definition The context: Why is this problem important to solve? The objectives: What is the intended goal? The key questions: What are the key questions that need to be answered? The problem formulation: What is it that we are trying to solve using data science? Data Description: The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant. BAD: 1 = Client defaulted on loan, 0 = loan repaid LOAN: Amount of loan approved. MORTDUE: Amount due on the existing mortgage. VALUE: Current value of the property. REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts) JOB: The type of job that loan applicant has such as manager, self, etc. YOJ: Years at present job. DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments). DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due). CLAGE: Age of the oldest credit line in months. NINQ: Number of recent credit inquiries. CLNO: Number of existing credit lines. DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow. Important Notes This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise. In the notebook, there are markdowns cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs. The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code. All the outputs in the notebook are just for reference and can be different if you follow a different approach. There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.
Import the necessary libraries In [109]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_theme() from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score from sklearn import tree from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier import scipy.stats as stats from sklearn.model_selection import GridSearchCV import warnings warnings.filterwarnings('ignore') Read the dataset In [5]: hm=pd.read_csv("hmeq.csv") In [6]: # Copying data to another variable to avoid any changes to original data data=hm.copy() Print the first and last 5 rows of the dataset In [7]: # Display first five rows # Remove ___________ and complete the code hm.head() Out[7]: BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE N 0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1. 1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0. 2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1. 3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN N
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE N 4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0. In [8]: # Display last 5 rows # Remove ___________ and complete the code hm.tail() Out[8]: BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE 5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 Understand the shape of the dataset In [9]: # Check the shape of the data # Remove ___________ and complete the code print(hm.shape) (5960, 13) Insights dataset has 5901 rows and 13 columns Check the data types of the columns In [10]: # Check info of the data # Remove ___________ and complete the code hm.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB Insights __ bad and loan are int, value and reason are obj while rest are float Check for missing values In [11]: # Analyse missing values - Hint: use isnull() function # Remove ___________ and complete the code #percent_missing = hm.isnull().sum() * 100 / len(hm) print(hm.isnull().sum()) BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64 In [12]: #Check the percentage of missing values in the each column. # Hint: divide the result from the previous code by the number of rows in the dataset # Remove ___________ and complete the code percent_missing = hm.isnull().sum() * 100 / len(hm) missing_value_hm = pd.DataFrame({'column_name': hm.columns, 'percent_missing': percent_missing}) print(missing_value_hm) column_name percent_missing BAD BAD 0.000000 LOAN LOAN 0.000000 MORTDUE MORTDUE 8.691275 VALUE VALUE 1.879195 REASON REASON 4.228188 JOB JOB 4.681208 YOJ YOJ 8.640940 DEROG DEROG 11.879195 DELINQ DELINQ 9.731544 CLAGE CLAGE 5.167785 NINQ NINQ 8.557047 CLNO CLNO 3.724832 DEBTINC DEBTINC 21.258389
Insights __ reason and job are sigificant information, DEBTINC, DEROG have the most null values., debtinc passing over the thereshold Think about it: We found the total number of missing values and the percentage of missing values, which is better to consider? What can be the limit for % missing values in a column in order to avoid it and what are the challenges associated with filling them and avoiding them? We can convert the object type columns to categories converting "objects" to "category" reduces the data space required to store the dataframe Convert the data types In [13]: cols = data.select_dtypes(['object']).columns.tolist() #adding target variable to this list as this is an classification problem and the target variable is categorical cols.append('BAD') In [14]: cols Out[14]: ['REASON', 'JOB', 'BAD'] In [15]: # Changing the data type of object type column to category. hint use astype() function # remove ___________ and complete the code hm= hm.astype({"BAD":'category', "REASON":'category',"JOB":'category'}) In [16]: # Checking the info again and the datatype of different variable # remove ___________ and complete the code print (hm.dtypes) BAD category LOAN int64 MORTDUE float64 VALUE float64 REASON category JOB category YOJ float64 DEROG float64 DELINQ float64 CLAGE float64 NINQ float64 CLNO float64 DEBTINC float64 dtype: object Analyze Summary Statistics of the dataset In [17]: # Analyze the summary statistics for numerical variables # Remove ___________ and complete the code
hm.describe().T Out[17]: count mean std min 25% 50% LOAN 5960.0 18607.969799 11207.480417 1100.000000 11100.000000 16300.0000 MORTDUE 5442.0 73760.817200 44457.609458 2063.000000 46276.000000 65019.0000 VALUE 5848.0 101776.048741 57385.775334 8000.000000 66075.500000 89235.5000 YOJ 5445.0 8.922268 7.573982 0.000000 3.000000 7.000000 DEROG 5252.0 0.254570 0.846047 0.000000 0.000000 0.000000 DELINQ 5380.0 0.449442 1.127266 0.000000 0.000000 0.000000 CLAGE 5652.0 179.766275 85.810092 0.000000 115.116702 173.466667 NINQ 5450.0 1.186055 1.728675 0.000000 0.000000 1.000000 CLNO 5738.0 21.296096 10.138933 0.000000 15.000000 20.000000 DEBTINC 4693.0 33.779915 8.601746 0.524499 29.140031 34.818262 Insights __ mean looks reasonable for most categories, with the lean of a loan being $18000 on average. value has a higher standard deviation, there much be a greater range in the value of their houses. the data looka to be maybe skewed to the right or normal, because the mean is higher than the median for almost all catetories but DEBTINC. There seem to be dome outliers in morgage due and value and debtinc ans well as perhaps in loan. In [18]: # Check summary for categorical data - Hint: inside describe function you can use the argument include=['category'] # Remove ___________ and complete the code hm.describe(include=['category']).T Out[18]: count unique top freq BAD 5960 2 0 4771 REASON 5708 2 DebtCon 3928 JOB 5681 6 Other 2388 Insights _ there seem to be more 0s for bad, and the most popular reason is DEbtcon,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
with other being the higherst job category Let's look at the unique values in all the categorical variables In [19]: # Checking the count of unique values in each categorical column # Remove ___________ and complete the code cols_cat= data.select_dtypes(['category']) for i in cols_cat.columns: print('Unique values in',i, 'are :') print(cols_cat) print('*'*40) Insights _ home imp anf dept con are the onlu tow reasins and other and office are the only ttwo caterg for job. Think about it The results above gave the absolute count of unique values in each categorical column. Are absolute values a good measure? If not, what else can be used? Try implementing that. Exploratory Data Analysis (EDA) and Visualization Univariate Analysis Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables 1. Univariate Analysis - Numerical Data Histograms and box plots help to visualize and describe numerical data. We use box plot and histogram to analyze the numerical columns. In [20]: # While doing uni-variate analysis of numerical variables we want to study their central tendency and dispersion. # Let us write a function that will help us create boxplot and histogram for any input numerical variable. # This function takes the numerical column as the input and return the boxplots and histograms for the variable. # Let us see if this help us write faster and cleaner code. def histogram_boxplot(feature, figsize=(15,10), bins = None): """ Boxplot and histogram combined feature: 1-d feature array figsize: size of fig (default (9,8)) bins: number of bins (default None / auto) """ f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2 sharex = True, # x-axis will be shared among all subplots gridspec_kw = {"height_ratios": (.25, .75)}, figsize = figsize ) # creating the 2 subplots sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram Using the above function, let's first analyze the Histogram and Boxplot for LOAN In [21]: # Build the histogram boxplot for Loan histogram_boxplot(data['LOAN']) Insights __ skewed to the right, most of the loans are below 40,000, median is around
17,000, minimum loan is around 1,000, middle range is from 11,000 to 16,000. A lot of outliers Note: As done above, analyze Histogram and Boxplot for other variables In [22]: histogram_boxplot(data['MORTDUE']) **Insights __________** a lot of outliers, skerwd to the right In [23]: histogram_boxplot(data['VALUE'])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Insights __ seems to be less variation within the data,with a lot of outliers, mediam value is In [24]: histogram_boxplot(data['YOJ'])
Values are more distributed, few minimal outliers skewed slightly to the right,0 is the highrst ans most popilar, so maybe unemployed? In [25]: histogram_boxplot(data['DEROG'])
# insights- 0 is most common, most dont have any derogatoy reports In [26]: histogram_boxplot(data['DELINQ'])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
INSIGHTS- most dont have any delinquent credit linesm there is no voxplot. In [27]: histogram_boxplot(data['CLAGE'])
INSIGHTS** skewed to the right, outliers group togethet majority under 400 In [28]: histogram_boxplot(data['NINQ'])
INSIGHTS few ouliers, skewed to the right, minimum/low values are most popular. In [29]: histogram_boxplot(data['CLNO'])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Insights, somewhat Normally distributed,few outliers In [30]: histogram_boxplot(data['DEBTINC'])
INSIGHt Less variation in data/ low distribution. 2. Univariate Analysis - Categorical Data In [ ]: In [31]: # Function to create barplots that indicate percentage for each category. def perc_on_bar(plot, feature): '''
plot feature: categorical feature the function won't work if a column is passed in hue parameter ''' total = len(feature) # length of the column for p in ax.patches: percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot y = p.get_y() + p.get_height() # height of the plot ax.annotate(percentage, (x, y), size = 12) # annotate the percentage plt.show() # show the plot Analyze Barplot for DELINQ In [32]: #Build barplot for DELINQ plt.figure(figsize=(15,5)) ax = sns.countplot(data["DELINQ"],palette='winter') perc_on_bar(ax,data["DELINQ"]) In [33]: plt.figure(figsize=(15,5)) ay = sns.countplot(data["JOB"],palette='winter') perc_on_bar(ax,data["JOB"])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Insights __ Other is the most common category, with PROFexe and ovvice being a close second In [34]: plt.figure(figsize=(15,5)) ay = sns.countplot(data["REASON"],palette='winter') perc_on_bar(ax,data["REASON"])
INSIGHTS- DEBT CON IS THE MOST COMMON REASON, ALMOST 2X HOMEIMP In [35]: plt.figure(figsize=(15,5)) ay = sns.countplot(data["BAD"],palette='winter') perc_on_bar(ax,data["BAD"])
Note: As done above, analyze Histogram and Boxplot for other variables. INSIGHTS - MAJORITY REPLAIRED THE LOAN Insights _ Bivariate Analysis Bivariate Analysis: Continuous and Categorical Variables Analyze BAD vs Loan In [36]: sns.boxplot(data["BAD"],data['LOAN'],palette="PuBu") Out[36]: <AxesSubplot:xlabel='BAD', ylabel='LOAN'> Insights __ THOSE WHO DEFAULTED LOAN SEEM TO HAVE A LOWER 50% than those who paid off. outliers are also lower In [37]: sns.boxplot(data["BAD"],data['VALUE'],palette="PuBu") Out[37]: <AxesSubplot:xlabel='BAD', ylabel='VALUE'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
**Insights ______** The two seem to be identical with min, max, there seem to be more outliers with a defaulted loan anf higher value of proerty, since there isnt that much difference in the 50% it might not be that sinificant other than the outliers In [38]: sns.boxplot(data["BAD"],data['MORTDUE'],palette="PuBu") Out[38]: <AxesSubplot:xlabel='BAD', ylabel='MORTDUE'> In [ ]: More outliers in defaulted loans, the median is also lower compared to replaired. In [39]: sns.boxplot(data["BAD"],data['YOJ'],palette="PuBu") Out[39]: <AxesSubplot:xlabel='BAD', ylabel='YOJ'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
an overall lower years present at job for defaulted loan, more outliers as well. In [40]: sns.boxplot(data["BAD"],data['DELINQ'],palette="PuBu") Out[40]: <AxesSubplot:xlabel='BAD', ylabel='DELINQ'> INSIGHTS- significant aspect, those with a deliqqent credit line are more likely to default loans In [ ]: In [41]: sns.boxplot(data["BAD"],data['CLAGE'],palette="PuBu") Out[41]: <AxesSubplot:xlabel='BAD', ylabel='CLAGE'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
insights- those who repiared loan have longer credit line on average, In [42]: sns.boxplot(data["BAD"],data['CLNO'],palette="PuBu") Out[42]: <AxesSubplot:xlabel='BAD', ylabel='CLNO'> insights- those who have a higher number- especially drastic are more likely to defaunt loans. In [43]: sns.boxplot(data["BAD"],data['NINQ'],palette="PuBu") Out[43]: <AxesSubplot:xlabel='BAD', ylabel='NINQ'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Insights- more credit lines are more occuring for defaulted loan. Note: As shown above, perform Bi-Variate Analysis on different pair of Categorical and continuous variables In [ ]: Bivariate Analysis: Two Continuous Variables In [44]: sns.scatterplot(data["VALUE"],data['MORTDUE'],palette="PuBu") Out[44]: <AxesSubplot:xlabel='VALUE', ylabel='MORTDUE'> Insights: _ positive, linerar In [45]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
sns.scatterplot(data["VALUE"],data['LOAN'],palette="PuBu") Out[45]: <AxesSubplot:xlabel='VALUE', ylabel='LOAN'> INSIGHTS no relationship In [46]: sns.scatterplot(data["VALUE"],data['MORTDUE'],palette="PuBu") Out[46]: <AxesSubplot:xlabel='VALUE', ylabel='MORTDUE'> In [47]: sns.scatterplot(data["DELINQ"],data['MORTDUE'],palette="PuBu") Out[47]: <AxesSubplot:xlabel='DELINQ', ylabel='MORTDUE'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [48]: sns.scatterplot(data["VALUE"],data['DELINQ'],palette="PuBu") Out[48]: <AxesSubplot:xlabel='VALUE', ylabel='DELINQ'> In [49]: sns.scatterplot(data["LOAN"],data['MORTDUE'],palette="PuBu") Out[49]: <AxesSubplot:xlabel='LOAN', ylabel='MORTDUE'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [50]: sns.scatterplot(data["DELINQ"],data['CLNO'],palette="PuBu") Out[50]: <AxesSubplot:xlabel='DELINQ', ylabel='CLNO'> Note: As shown above, perform Bivariate Analysis on different pairs of continuous variables Insights __ groups have diifernt locations, value and mortg due is the only signifacnt relationship. Bivariate Analysis: BAD vs Categorical Variables The stacked bar chart (aka stacked bar graph) extends the standard bar chart from looking at numeric values across one categorical variable to two. In [51]: ### Function to plot stacked bar charts for categorical columns
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
def stacked_plot(x): sns.set(palette='nipy_spectral') tab1 = pd.crosstab(x,data['BAD'],margins=True) print(tab1) print('-'*120) tab = pd.crosstab(x,data['BAD'],normalize='index') tab.plot(kind='bar',stacked=True,figsize=(10,5)) plt.legend(loc='lower left', frameon=False) plt.legend(loc="upper left", bbox_to_anchor=(1,1)) plt.show() Plot stacked bar plot for for LOAN and REASON In [52]: # Plot stacked bar plot for BAD and REASON stacked_plot(data['REASON']) BAD 0 1 All REASON DebtCon 3183 745 3928 HomeImp 1384 396 1780 All 4567 1141 5708 --------------------------------------------------------------------------------------- --------------------------------- In [53]: stacked_plot(data['JOB']) BAD 0 1 All JOB Mgr 588 179 767
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Office 823 125 948 Other 1834 554 2388 ProfExe 1064 212 1276 Sales 71 38 109 Self 135 58 193 All 4515 1166 5681 --------------------------------------------------------------------------------------- --------------------------------- Insights __ Those who defaulted loans are more likely to due to DEBTCON than from home imp. Those woth a sales job have a higher defaulted loan rate, with a close second to self emplyed. least likeley is office. Insights ___ Multivariate Analysis Analyze Correlation Heatmap for Numerical Variables In [90]: # Separating numerical variables numerical_col = data.select_dtypes(include=np.number).columns.tolist() # Build correlation matrix for numerical columns # Remove ___________ and complete the cod corr = hm[numerical_col].corr() # plot the heatmap
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Remove ___________ and complete the code plt.figure(figsize=(16,12)) sns.heatmap(corr,cmap='coolwarm',vmax=1,vmin=-1, fmt=".2f", xticklabels=corr.columns, yticklabels=corr.columns);
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [55]: # Build pairplot for the data with hue = 'BAD' # Remove ___________ and complete the code
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
g = sns.pairplot(hm, hue='BAD')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Think about it Are there missing values and outliers in the dataset? If yes, how can you treat them? Can you think of different ways in which this can be done and when to treat these outliers or not? Can we create new features based on Missing values? In [ ]: Treating Outliers In [91]: def treat_outliers(df,col): ''' treats outliers in a varaible col: str, name of the numerical varaible df: data frame col: name of the column ''' Q1=.25 # 25th quantile Q3=.75 # 75th quantile IQR= Q3-Q1 # IQR Range Lower_Whisker = Q1 - 1.5 * IQR #define lower whisker Upper_Whisker = Q3 + 1.5 * IQR # define upper Whisker df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker # and all the values above upper_whishker will be assigned value of upper_Whisker return df def treat_outliers_all(df, col_list): ''' treat outlier in all numerical varaibles col_list: list of numerical varaibles df: data frame ''' for c in col_list: df = treat_outliers(df,c) return df In [92]: df_raw = data.copy() numerical_col = df_raw.select_dtypes(include=np.number).columns.tolist()# getting list of numerical columns df = treat_outliers_all(df_raw,numerical_col) Adding new columns in the dataset for each column which has missing values In [93]: #For each column we create a binary flag for the row, if there is missing value in the row, then 1 else 0. def add_binary_flag(df,col):
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
''' df: It is the dataframe col: it is column which has missing values It returns a dataframe which has binary falg for missing values in column col ''' new_col = str(col) new_col += '_missing_values_flag' df[new_col] = df[col].isna() return df In [94]: # list of columns that has missing values in it missing_col = [col for col in df.columns if df[col].isnull().any()] for colmn in missing_col: add_binary_flag(df,colmn) Filling missing values in numerical columns with median and mode in categorical variables In [100]: # Treat Missing values in numerical columns with median and mode in categorical variables # Select numeric columns. num_data = data.select_dtypes('number') # Select string and object columns. cat_data = data.select_dtypes('category').columns.tolist() df.select_dtypes('object') # Fill numeric columns with median. # Remove _________ and complete the code data[num_data.columns] = num_data.median() # Fill object columns with model. # Remove _________ and complete the code for column in cat_data: mode = data[column].mode()[0] data[column] = data[column].mode() Proposed approach 1. Potential techniques - What different techniques should be explored? 2. Overall solution design - What is the potential solution design? 3. Measures of success - What are the key measures of success? ### **Separating the target variable from other variables** In [108]: #Drop the dependent variable from the dataframe and create the X(independent variable) matrix # Remove _________ and complete the code X = data.drop(columns = 'BAD') # Create dummy variables for the categorical variables - Hint: use the get_dummies() function # Remove _________ and complete the code
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
X = pd.get_dummies(hm, columns=['JOB','REASON','REASON']) #X=pd.get_dummies(data[cols],drop_first=True) # Create y(dependent varibale) y = hm.iloc[:, 0].values print(y) [1, 1, 1, 1, 0, ..., 0, 0, 0, 0, 0] Length: 5960 Categories (2, int64): [0, 1] In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help