Crime Report

Luyu Jin (lj1035), Siyuan Xiang (sx550), Daniel Amaranto (da1933)

Setup

Download csv version of NYPD Complaint Data Historic Dataset and rename it as crime.csv.
Put it on Hadoop File System using hfs -put crime.csv.

Run PySpark Code

For convenience, we wrote a bash script src/submit.sh to deal with PySpark code.

To use the script:

cd into src folder.
Run bash submit.sh jobname, where the jobname is filename without suffix. For example, to submit all_days.py to PySpark, just run bash submit.sh all_days.
Enjoy the results!

The script will:

Clear corresponding output folder in HDFS if previous output folder exists.
Submit PySpark job.
Get the merged output file and save it to results/jobname.out.
Print the first 100 rows of the output file.

For scripts (named check_*.py) checking types and validity, the output format of each row will be:

base_type  semantic_type label

which is separated by a tab.

Count numbers of NULL/VALID/INVALID

We wrote a bash script src/count_labels.sh to count numbers of NULL/VALID/INVALID instances for all columns.

To use the script:

cd into src folder.
Run bash count_labels.sh.
Enjoy the results!

The script will:

Count numbers of NULL/VALID/INVALID instances for all columns from all results/check_*.out files.
Save the counts to results/count_labels.out.
Print the output file.

The output format of each job will be:

jobname
number of NULL
number of VALID
number of INVALID

Visualization

The following packages were used:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date
from project_env import plots, split_years

Run these files from a folder that also contains PySpark output files in a subfolder called 'results'.

To run the visualizations of yearly data in Yearly_Crime.ipynb, the python file project_env.py should also be contained in the same folder as the notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data_merge_csvs		data_merge_csvs
results		results
src		src
.gitignore		.gitignore
All_Crime_Plots.ipynb		All_Crime_Plots.ipynb
Criminal Features - Report.pdf		Criminal Features - Report.pdf
Keycode_Crime_Plots.ipynb		Keycode_Crime_Plots.ipynb
Level_Crime_Plots.ipynb		Level_Crime_Plots.ipynb
Merge_Data.ipynb		Merge_Data.ipynb
ProjectAnalysis.ipynb		ProjectAnalysis.ipynb
README.md		README.md
Yearly_Crime.ipynb		Yearly_Crime.ipynb
project_env.py		project_env.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crime Report

Luyu Jin (lj1035), Siyuan Xiang (sx550), Daniel Amaranto (da1933)

Setup

Run PySpark Code

To use the script:

The script will:

Count numbers of NULL/VALID/INVALID

To use the script:

The script will:

Visualization

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

da1933/crimeproject

Folders and files

Latest commit

History

Repository files navigation

Crime Report

Luyu Jin (lj1035), Siyuan Xiang (sx550), Daniel Amaranto (da1933)

Setup

Run PySpark Code

To use the script:

The script will:

Count numbers of NULL/VALID/INVALID

To use the script:

The script will:

Visualization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages