- Download csv version of NYPD Complaint Data Historic Dataset and rename it as
crime.csv. - Put it on Hadoop File System using
hfs -put crime.csv.
For convenience, we wrote a bash script src/submit.sh to deal with PySpark code.
cdintosrcfolder.- Run
bash submit.sh jobname, where thejobnameis filename without suffix. For example, to submitall_days.pyto PySpark, just runbash submit.sh all_days. - Enjoy the results!
- Clear corresponding output folder in HDFS if previous output folder exists.
- Submit PySpark job.
- Get the merged output file and save it to
results/jobname.out. - Print the first 100 rows of the output file.
For scripts (named check_*.py) checking types and validity, the output format of each row will be:
base_type semantic_type label
which is separated by a tab.
We wrote a bash script src/count_labels.sh to count numbers of NULL/VALID/INVALID instances for all columns.
cdintosrcfolder.- Run
bash count_labels.sh. - Enjoy the results!
- Count numbers of NULL/VALID/INVALID instances for all columns from all
results/check_*.outfiles. - Save the counts to
results/count_labels.out. - Print the output file.
The output format of each job will be:
jobname
number of NULL
number of VALID
number of INVALID
The following packages were used:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date
from project_env import plots, split_years
Run these files from a folder that also contains PySpark output files in a subfolder called 'results'.
To run the visualizations of yearly data in Yearly_Crime.ipynb, the python file project_env.py should also be contained in the same folder as the notebook.