# Homework 3 – MapReduce ## Problem Statement: We are greatly inspired by the [Consumer Complaints](https://github.com/InsightDataScience/consumer_complaints) challenge from the popular InsightDataScience. In fact, we are going to tackle the same challenge but using MapReduce. Please read through the challenge (the most important sections for us are “Input dataset” and “Expected output"). ## Requirements: 1. You must perform your computations using only Python and the MRJob package that we use in class. No external packages, e.g. pandas, are allowed. 2. Your code must be able to run as a stand-alone MRJob application. ## INPUT: Your code will be evaluated against a sample of the original data set (in CSV format) downloaded from: [https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data](https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data) The original data set is roughly 1GB but the sample file is only 4MB, and is available on our class resources under Data Sets > complaints_sample.csv. You can use this file for testing your code within a notebook if you prefer. **NOTE:** This CSV file contains multiple-line records. Please pay attention to this when reading the data. ## OUTPUT: You are required to write to the standard output in CSV format. Basically, you have to organize each of your records as a CSV row when you output from Spark. The output does not have to contain the header line. ## SUBMISSION: The final hand-in should be a single file, named `BDM_HW3_LastName.py` that takes exactly 1 argument for the input path. Output will be handled through redirection. ## SAMPLE RUN: ``` python BDM_HW3_LastName.py complaints_sample.csv > output.csv ```

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Please help me in python using MR job only.

# Homework 3 – MapReduce

## Problem Statement:
We are greatly inspired by the [Consumer Complaints](https://github.com/InsightDataScience/consumer_complaints) challenge from the popular InsightDataScience. In fact, we are going to tackle the same challenge but using MapReduce. Please read through the challenge (the most important sections for us are “Input dataset” and “Expected output").

## Requirements:
1. You must perform your computations using only Python and the MRJob package that we use in class. No external packages, e.g. pandas, are allowed.
2. Your code must be able to run as a stand-alone MRJob application.

## INPUT:
Your code will be evaluated against a sample of the original data set (in CSV format) downloaded from: [https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data](https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data)

The original data set is roughly 1GB but the sample file is only 4MB, and is available on our class resources under Data Sets > complaints_sample.csv. You can use this file for testing your code within a notebook if you prefer.

**NOTE:** This CSV file contains multiple-line records. Please pay attention to this when reading the data.

## OUTPUT:
You are required to write to the standard output in CSV format. Basically, you have to organize each of your records as a CSV row when you output from Spark. The output does not have to contain the header line.

## SUBMISSION:
The final hand-in should be a single file, named `BDM_HW3_LastName.py` that takes exactly 1 argument for the input path. Output will be handled through redirection.

## SAMPLE RUN:
```
python BDM_HW3_LastName.py complaints_sample.csv > output.csv
```
Transcribed Image Text:# Homework 3 – MapReduce ## Problem Statement: We are greatly inspired by the [Consumer Complaints](https://github.com/InsightDataScience/consumer_complaints) challenge from the popular InsightDataScience. In fact, we are going to tackle the same challenge but using MapReduce. Please read through the challenge (the most important sections for us are “Input dataset” and “Expected output"). ## Requirements: 1. You must perform your computations using only Python and the MRJob package that we use in class. No external packages, e.g. pandas, are allowed. 2. Your code must be able to run as a stand-alone MRJob application. ## INPUT: Your code will be evaluated against a sample of the original data set (in CSV format) downloaded from: [https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data](https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data) The original data set is roughly 1GB but the sample file is only 4MB, and is available on our class resources under Data Sets > complaints_sample.csv. You can use this file for testing your code within a notebook if you prefer. **NOTE:** This CSV file contains multiple-line records. Please pay attention to this when reading the data. ## OUTPUT: You are required to write to the standard output in CSV format. Basically, you have to organize each of your records as a CSV row when you output from Spark. The output does not have to contain the header line. ## SUBMISSION: The final hand-in should be a single file, named `BDM_HW3_LastName.py` that takes exactly 1 argument for the input path. Output will be handled through redirection. ## SAMPLE RUN: ``` python BDM_HW3_LastName.py complaints_sample.csv > output.csv ```
Expert Solution
Step 1

Python Programming :

Python is a deciphered, significant level and universally useful programming language. Made by Person GVD and first delivered in 1991, Python's plan theory stresses code meaningfulness with its striking utilization of huge whitespace. Its language develops and object-situated methodology mean to assist software engineers with composing clear, consistent code for little and enormous scope projects.
Python is powerfully composed and trash gathered. It underpins different programming standards, including organized (especially, procedural), object-situated, and utilitarian programming. Python is regularly depicted as a "batteries included" language because of its exhaustive standard library. 

Python was made in the last part of the 1980s as a replacement to the ABC language. Python 2.0, delivered in 2000, presented highlights like rundown appreciations and a trash assortment framework with reference checking. 

Python 3.0, delivered in 2008, was a significant amendment of the language that isn't totally in reverse viable, and much Python 2 code doesn't run unmodified on Python 3. 

The Python 2 language was formally stopped in 2020 (first anticipated 2015), and "Python 2.7.18 is the last Python 2.7 delivery and consequently the last Python 2 release." No greater security patches or different enhancements will be delivered for it. With Python 2's finish of-life, just Python 3.6.x and later are upheld.

trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 3 steps with 1 images

Blurred answer
Knowledge Booster
Function Arguments
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education