11_6 Class 22 In-Class Assignment - Colaboratory

.pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

211

Subject

Industrial Engineering

Date

Apr 3, 2024

Type

pdf

Pages

9

Uploaded by haydyn

In this notebook, you will learn how to create a decision tree for a classi±cation prediction problem , i.e., to predict a categorical outcome. In particular, you will be given a dataset with information about used cars and the price category they ultimately sell for. The goal is to predict whether the price of used cars is in the high price category or not. Therefore, the outcome variable is High_price , where the positive class means that the car sold for a high price, and the negative class means it did not sell for a high price. Decision Trees: Classi±cation We will take the following steps in this notebook: 1. Install and import the packages 2. Read and inspect the data 3. Data preparation for predictive modeling 4. Fit and evaluate the Classi±cation Decision Tree Model 5. Tune the model to improve the prediction accuracy ². Model evaluation and comparision using an ROC curve and the AUC score Step 1: Install and import packages !pip install dmba Requirement already satisfied: dmba in /usr/local/lib/python3.10/dist-packages (0.2.4) Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from dmba) (0.20.1) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from dmba) (3.7.1) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from dmba) (1.23.5) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from dmba) (1.5.3) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from dmba) (1.2.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from dmba) (1.11.3) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (1.1.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (4.43.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (1.4.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (23.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dmba) (2.8. Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->dmba) (2023.3.post1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->dmba) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->dmba) (3. Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->d # the usual packages import pandas as pd from sklearn.model_selection import train_test_split # decision tree algorithm from sklearn.tree import DecisionTreeClassifier # tree visualization and model evaluation from dmba import classificationSummary, plotDecisionTree # model evaluation and roc curve from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve import matplotlib.pylab as plt Step 2: Read and inspect data df = pd.read_csv('https://drive.google.com/uc?id=1_4DNYK67qT0W8J5q7tmNwMBmyHUc3W2h') How many rows and columns are there? What are the columns and data types? Are there any missing values? # Get the number of rows and columns num_rows, num_columns = df.shape
print(f"Number of rows: {num_rows}") print(f"Number of columns: {num_columns}") Number of rows: 1436 Number of columns: 39 df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1436 entries, 0 to 1435 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1436 non-null int64 1 Model 1436 non-null object 2 Age_08_04 1436 non-null int64 3 Mfg_Month 1436 non-null int64 4 Mfg_Year 1436 non-null int64 5 KM 1436 non-null int64 6 Fuel_Type 1436 non-null object 7 HP 1436 non-null int64 8 Met_Color 1436 non-null int64 9 Color 1436 non-null object 10 Automatic 1436 non-null int64 11 CC 1436 non-null int64 12 Doors 1436 non-null int64 13 Cylinders 1436 non-null int64 14 Gears 1436 non-null int64 15 Quarterly_Tax 1436 non-null int64 16 Weight 1436 non-null int64 17 Mfr_Guarantee 1436 non-null int64 18 BOVAG_Guarantee 1436 non-null int64 19 Guarantee_Period 1436 non-null int64 20 ABS 1436 non-null int64 21 Airbag_1 1436 non-null int64 22 Airbag_2 1436 non-null int64 23 Airco 1436 non-null int64 24 Automatic_airco 1436 non-null int64 25 Boardcomputer 1436 non-null int64 26 CD_Player 1436 non-null int64 27 Central_Lock 1436 non-null int64 28 Powered_Windows 1436 non-null int64 29 Power_Steering 1436 non-null int64 30 Radio 1436 non-null int64 31 Mistlamps 1436 non-null int64 32 Sport_Model 1436 non-null int64 33 Backseat_Divider 1436 non-null int64 34 Metallic_Rim 1436 non-null int64 35 Radio_cassette 1436 non-null int64 36 Parking_Assistant 1436 non-null int64 37 Tow_Bar 1436 non-null int64 38 High_Price 1436 non-null object dtypes: int64(35), object(4) memory usage: 437.7+ KB The outcome variable is High_Price . What is the distribution of possible values for this variable? # Check the distribution of values for the "High_Price" variable high_price_distribution = df['High_Price'].value_counts() # Display the distribution print(high_price_distribution) No 1073 Yes 363 Name: High_Price, dtype: int64 Step 3: Prepare data for modeling We have cleaned the dataset for you, so there are no missing values or duplicate rows to remove. We need to remove the column Id , as it is an index and does not provide any information. We do this using the drop method. # Your code here # Step 3: Prepare data for modeling
# Remove the "Id" column from the DataFrame df = df.drop(columns=['Id']) # Verify that the "Id" column has been removed print(df.head()) Model Age_08_04 Mfg_Month \ 0 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 23 10 1 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 23 10 2 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 24 9 3 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 26 7 4 TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 30 3 Mfg_Year KM Fuel_Type HP Met_Color Color Automatic ... \ 0 2002 46986 Diesel 90 1 Blue 0 ... 1 2002 72937 Diesel 90 1 Silver 0 ... 2 2002 41711 Diesel 90 1 Blue 0 ... 3 2002 48000 Diesel 90 0 Black 0 ... 4 2002 38500 Diesel 90 0 Black 0 ... Power_Steering Radio Mistlamps Sport_Model Backseat_Divider \ 0 1 0 0 0 1 1 1 0 0 0 1 2 1 0 0 0 1 3 1 0 0 0 1 4 1 0 1 0 1 Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar High_Price 0 0 0 0 0 Yes 1 0 0 0 0 Yes 2 0 0 0 0 Yes 3 0 0 0 0 Yes 4 0 0 0 0 Yes [5 rows x 38 columns] Convert the outcome variable from text to numbers. # Your code here from sklearn.preprocessing import LabelEncoder # Create a LabelEncoder instance label_encoder = LabelEncoder() # Fit the LabelEncoder to the "High_Price" column and transform the values df['High_Price'] = label_encoder.fit_transform(df['High_Price']) # Verify the conversion print(df['High_Price'].unique()) [1 0] Convert categorical variables to dummy variables. Note that multicolinearity is not an issue in decision trees. Therefore, when we create dummy variables, it is not necessary to drop one of them - in fact, we want to keep all of them. df = pd.get_dummies(df, drop_first=False) Create y (the response variable) and X (a matrix of predictor variables). # Your code here # Define the response variable (y) and the predictor variables (X) y = df['High_Price'] # Response variable X = df.drop(columns=['High_Price']) # Predictor variables # Verify the separation print("Response Variable (y):") print(y.head()) print("\nPredictor Variables (X):") print(X.head())
Response Variable (y): 0 1 1 1 2 1 3 1 4 1 Name: High_Price, dtype: int64 Predictor Variables (X): Age_08_04 Mfg_Month Mfg_Year KM HP Met_Color Automatic CC \ 0 23 10 2002 46986 90 1 0 2000 1 23 10 2002 72937 90 1 0 2000 2 24 9 2002 41711 90 1 0 2000 3 26 7 2002 48000 90 0 0 2000 4 30 3 2002 38500 90 0 0 2000 Doors Cylinders ... Color_Beige Color_Black Color_Blue Color_Green \ 0 3 4 ... 0 0 1 0 1 3 4 ... 0 0 0 0 2 3 4 ... 0 0 1 0 3 3 4 ... 0 1 0 0 4 3 4 ... 0 1 0 0 Color_Grey Color_Red Color_Silver Color_Violet Color_White \ 0 0 0 0 0 0 1 0 0 1 0 0 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 Color_Yellow 0 0 1 0 2 0 3 0 4 0 [5 rows x 419 columns] Partition the dataset into training and test sets. Specify the test size as 0.2, and random_state as 1. # Your code here from sklearn.model_selection import train_test_split # Partition the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # Verify the shapes of the training and test sets print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape) print("Test set shape (X_test, y_test):", X_test.shape, y_test.shape) Training set shape (X_train, y_train): (1148, 419) (1148,) Test set shape (X_test, y_test): (288, 419) (288,) Step 4: Classi±cation tree model Let's start by building a simple classi±cation tree. A detailed description of the DecisionTreeClassifier() can be found here . As with previous methods, we ±rst initialize the model. We are using DecisionTreeClassifier() . We'll name the object we are going to work with classtree . A simple model classtree = DecisionTreeClassifier(random_state=1) Next, ±t the model on the training set. # Fit the classification tree model on the training set classtree.fit(X_train, y_train)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help