Overview and Rationale
Spark’s intended use is for data lakes which were discussed previously. It is important to be able to process these large data sets effectively with Spark. This assignment will provide you with experience and practice in using Spark to analyze a large data set.
Assignment Summary
For this assignment, you will download and process, with Spark, two of the following datasets.
I am sharing some resources with you, but feel free to pick your own problem/dataset.
- https://www.data.gov/cities/
- https://data.boston.gov
- https://opendata.cityofnewyork.us
- census.gov
- https://www.cdc.gov/datastatistics/index.html
- https://www.bls.gov/data/
Write a 3-5 report that includes a section for each data set you choose to analyze. For each data set include
- A description of the steps you took to perform the analysis, with screen shots
- Results of your analysis
- Your insights based on your analysis
Format & Guidelines
The paper should follow the following format:
(i) Introduction
Provide a short description of the dataset you analyzed and purpose for the analysis. Identify questions you are attempting to answer with or insights you want to gain from the analysis
(ii) Analysis and results
Outline your steps, with screen shots, and provide the results of your analysis. Connect the results and your analysis to the purpose described in the introduction. Be specific.
(iii) Insights
Provide your insights based on your analysis. Connect your insights to the purpose of the analysis.