Skip to main content

Engineering Computer Science

decision_trees.ipynb [Part 2] 1. Pima Indian Diabetes Dataset The Pima Indians Diabetes Data Set was developed by the United States National Institute of Diabetes and Digestive and Kidney Diseases. Astonishingly, over 30% of Pima people develop diabetes. In contrast, the diabetes rate in the United States is 8.3% and in China it is 4.2%. Each instance in the dataset represents information about a Pima woman over the age of 21 and belonged to one of two classes: a person who developed diabetes within five years, or a person that did not. There are eight attributes in addition to the column representing whether or not they developed diabetes:

decision_trees.ipynb [Part 2] 1. Pima Indian Diabetes Dataset The Pima Indians Diabetes Data Set was developed by the United States National Institute of Diabetes and Digestive and Kidney Diseases. Astonishingly, over 30% of Pima people develop diabetes. In contrast, the diabetes rate in the United States is 8.3% and in China it is 4.2%. Each instance in the dataset represents information about a Pima woman over the age of 21 and belonged to one of two classes: a person who developed diabetes within five years, or a person that did not. There are eight attributes in addition to the column representing whether or not they developed diabetes:

Database System Concepts

Database System Concepts

7th Edition

ISBN: 9780078022159

Author: Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher: McGraw-Hill Education

See similar textbooks

Related questions

Q: explain the various mapping algorithms used for caching data?

A: Can you explain the various mapping algorithms used for caching data?

Q: What precisely are a Data Dictionary and a Contrast Repository?

A: A Data Dictionary is a centralized repository or database that stores metadata or information about…

Q: Please explain the structure of a DataSet to me.

A: A DataSet is a structured and organised method for managing and manipulating data, serving as a core…

Q: Which of the following statements are correct? IPv6 datagrams do not have the checksum field IPv4…

A: Solution : IPv4 and IPv6 The IPv4 network protocol is a packet-switched Link Layer protocol (e.g.…

Q: Big data is flexible since it may come in both structured and unstructured formats.

A: Big data refers to extremely large and complex data sets that are difficult to process and analyze…

Q: What use does it serve to exclude a data item from a data model?

A: Introduction: A data model (datamodel) is an abstract model that organizes and standardizes the…

Q: Big data is flexible since it may be obtained in several formats and can be either organised or…

A: Big data is the term used to describe the enormous volume of structured, semi-structured, and…

Q: Explain the difference between a shallow copy and a deep copy of data.

A: They differ in copying the pointer variable. If, for example, “first” and “sec” are integer pointers…

Q: What function does it serve to omit a data item from a data model?

A: The term "data model" (or "datamodel") refers to an abstract model that structures data elements and…

Q: Dependencies within data analysis calculations can cause "biased" results. What name is given to…

A: data analysis:It is a process of cleaning ,transforming,and modelling the data for further use .It…

Q: How might CASE tools be used to document the design of a data dictionary?

A: CASE stands for Software Aided System Engineering.

Q: Preprocessing data is critical to data analysis and mining. sklearn is a popular machine learning…

A: Introduction: To convert raw feature vectors into a format that is more suited for downstream…

Q: te the results into your report, highl d add the following: are the means if needed for the fact s…

A: Factorial ANOVA - two factors: "Sun_Exposure", "Gender" The two-way ANOVA results indicate that both…

Q: Please select two machine learning classifiers (it can be your two favorites) to predict if a…

A: In this question we have to write a python code machine learning task to predict whether a patient…

Q: In SQL MurachCollege database. Provide a list of all of the students, the number of courses they…

A: The SQL Query is given in the below step

Q: 15 [10 Points]: What is the Cartesian product of the two tables?

A: Cartesian product (x):- it is a binary operation need not be union compatible

Q: With the following basic database information: FARMERS MARKET DATABASE • Farmers input what PRODUCE…

A: Answer: I have given answered in the handwritten format

Concept explainers

Fundamentals of Big Data Analytics

Big data analytics is the process of using advanced analyzing techniques against huge variant data sets to uncover hidden pattern or knowledge. It helps in organization's decision making process. Big data can be organized in any of the following formats.

Question

decision_trees.ipynb [Part 2]

1. Pima Indian Diabetes Dataset

The Pima Indians Diabetes Data Set was developed by the United States National Institute of Diabetes and Digestive and Kidney Diseases.

Astonishingly, over 30% of Pima people develop diabetes. In contrast, the diabetes rate in the United States is 8.3% and in China it is 4.2%.

Each instance in the dataset represents information about a Pima woman over the age of 21 and belonged to one of two classes: a person who developed diabetes within five years, or a person that did not. There are eight attributes in addition to the column representing whether or not they developed diabetes:

The number of times the woman was pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age
Whether they got diabetes or not (0 = no, 1 = yes)

We are trying to predict whether they got diabetes or not based on the features.

The csv file at is at

https://raw.githubusercontent.com/yew1eb/machine-learning/master/Naive-bayes/pima-indians-diabetes.data.csv

This file does not have a header row

You will need to

load the file into a dataframe
divide the data into training and test sets. (an 80-20 split sounds good)
train a decision tree classifier on the training data
display the tree
run the classifier on the test data
compute the accuracy
Have a small paragraph describing the results.

Good luck!

[].....

[].....

[]....

2. The Wisconsin Cancer Datasett

The task is to predict whether a tumor is malignant or benign (the second column of the dataset based on 30 real values.

The data file is

https://raw.githubusercontent.com/zacharski/ml-class/master/data/wdbc.data

And a writeup about the data is at:

https://raw.githubusercontent.com/zacharski/ml-class/master/data/wdbc.names

Follow the same steps as above.

[]......

[].....

Expert Solution

This question has been solved!

Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.

bartleby

This is a popular solution

See solution Check out a sample Q&A here

To answer your questions in the given notebook, please follow these steps:

Explanation

bartleby

Trending nowThis is a popular solution!

bartleby

Step by stepSolved in 2 steps

Check out a sample Q&A here

Blurred answer

Knowledge Booster

Background pattern image

Computer Science

Learn more about

Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.

Similar questions

Recommended textbooks for you

Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education

Text book image

Database System Concepts

Computer Science

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:McGraw-Hill Education

Text book image

Starting Out with Python (4th Edition)

Computer Science

ISBN:9780134444321

Author:Tony Gaddis

Publisher:PEARSON

Text book image

Digital Fundamentals (11th Edition)

Computer Science

ISBN:9780132737968

Author:Thomas L. Floyd

Publisher:PEARSON

Text book image

C How to Program (8th Edition)

Computer Science

ISBN:9780133976892

Author:Paul J. Deitel, Harvey Deitel

Publisher:PEARSON

Text book image

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781337627900

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Text book image

Programmable Logic Controllers

Computer Science

ISBN:9780073373843

Author:Frank D. Petruzella

Publisher:McGraw-Hill Education

SEE MORE TEXTBOOKS