decision_trees.ipynb [Part 2] 1. Pima Indian Diabetes Dataset The Pima Indians Diabetes Data Set was developed by the United States National Institute of Diabetes and Digestive and Kidney Diseases. Astonishingly, over 30% of Pima people develop diabetes. In contrast, the diabetes rate in the United States is 8.3% and in China it is 4.2%. Each instance in the dataset represents information about a Pima woman over the age of 21 and belonged to one of two classes: a person who developed diabetes within five years, or a person that did not. There are eight attributes in addition to the column representing whether or not they developed diabetes:
decision_trees.ipynb [Part 2]
1. Pima Indian Diabetes Dataset
The Pima Indians Diabetes Data Set was developed by the United States National Institute of Diabetes and Digestive and Kidney Diseases.
Astonishingly, over 30% of Pima people develop diabetes. In contrast, the diabetes rate in the United States is 8.3% and in China it is 4.2%.
Each instance in the dataset represents information about a Pima woman over the age of 21 and belonged to one of two classes: a person who developed diabetes within five years, or a person that did not. There are eight attributes in addition to the column representing whether or not they developed diabetes:
- The number of times the woman was pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age
- Whether they got diabetes or not (0 = no, 1 = yes)
We are trying to predict whether they got diabetes or not based on the features.
The csv file at is at
https://raw.githubusercontent.com/yew1eb/machine-learning/master/Naive-bayes/pima-indians-diabetes.data.csv
This file does not have a header row
You will need to
- load the file into a dataframe
- divide the data into training and test sets. (an 80-20 split sounds good)
- train a decision tree classifier on the training data
- display the tree
- run the classifier on the test data
- compute the accuracy
- Have a small paragraph describing the results.
Good luck!
[].....
[].....
[]....
2. The Wisconsin Cancer Datasett
The task is to predict whether a tumor is malignant or benign (the second column of the dataset based on 30 real values.
The data file is
https://raw.githubusercontent.com/zacharski/ml-class/master/data/wdbc.data
And a writeup about the data is at:
https://raw.githubusercontent.com/zacharski/ml-class/master/data/wdbc.names
Follow the same steps as above.
[]......
[].....
Trending now
This is a popular solution!
Step by step
Solved in 2 steps