DVRPC Project

Scenario

The scenario that we have is that the Office of Safe Streets (OSS) has requested support in developing a reproducible data process to consume crash data and population for a quick high-level understanding of crashes per capita by municipality across the region.

Setup

# Create venv :
python3 -m venv .venv

# Activate venv:
. .venv/bin/activate

# Sync dependencies from `requirements.txt`:
pip install -r requirements.txt

# Run application and follow prompts
python3 app.py

Application

This application has 3 main components, a simple command line interface (CLI) with prompts to customize the report, a simple database to sqlize CSV data, and a report generator that queries the database and generates a csv report from the custom inputs determined in the CLI. The CLI has two prompts, MIN YEAR and MAX YEAR. These prompts determine the range of years that will be outputted into the report CSV. For example, if 2015-2020 is selected, the report will provide per capita data for every municipality between 2015 and 2020. If the user selects 2020-2020, it will only output data for 2020. Allowing for a range of years can give context on crash trends in municipalities, while looking at a single year can be a quick way to compare municipalities for the year of interest. note: At the moment there is no validation for entered years. With a little bit more time working on this I would add it, but for now just enter the years within the displayed range (ascending).

Data Source

This application draws from two different data sources that were provided in the assignment: Crash Data, Population Data I explored other potential datasets but each had their own shortcomings. One dataset provided detailed population counts for each municipality by year, but it did not contain information for 2020 and beyond, which are key years for analysis and future proofing. It also did not divide philadelphia into planning districts as it is in the crash dataset.

Database

Each time the project is exectuted, the application first downloads the Crash and Population data as CSVs then dumps them into a local sqlite3 database. Two tables are created: mcd_crash and mcd_population. This allows for simpler querying and expanded data manipulation options of crash and population data. note: I'm aware that the tables can be directly queried from the DVRPC api, however, this was more a demonstration of my ability to create a simple database

Report

To create a high level understanding of crashes per capita in each municipality, I chose what I believe to be the 3 most relevant fields to understanding the data at a glance. Drawing from the original crash data, I wanted to look at the Number of people Killed or Severely injured, seperated as three field types: total, pedestrians, and cyclists. Pedestrian and Cyclists are the most vulnerable road users (as are motorcyclists but that data is not provided), and the overall total provides the clearest indication of crash frequency. For these three field types, the per capita (per 100k) by year, and a per capita (per 100k) 5yr average is calculated. I chose to include a 5yr average as some munipalities have very small populations, and a small number of crashes can in these municipalities results in large outlier per capita values. note: With the current set of population and crash data, 5yr averages can only be calculated for 2019 & 2020. The population data only provides estimates for 2015 and onwards, so we cannot calculate a 5yr average for 2018 and lower.

Population estimates

The population data only provides estimates for 2015, 2019, 2020, and 5yr increments up to 2050. To calculate accurate per capita values for each year, we need to fill in the gaps in the missing years. For the purpose of this project, I estimate the population by calculation the linear change by year between 2015-2019, and 2020-2025. While not 100% accurate as population change isn't entirely linear, it provides an estimate that will have a neglible effect on the per capita total that is calculated. With more time on this project, I would like to have created a dataset that has a combination of real an future estimated populations by municipality using census data.

Math

For each municipality we make the following calculation from its population and the crash total for each field type, rounded to one decimal point Per capita: (crash count / population) * 100,000 Per capita 5yr average: (sum of crash counts in 5 year range / sum of population in 5 year range) * 100,000

Fields

Column	Description
County	County
MCD_NAME	Municipal Civil Designation
MCD_ID	2010 Census geography GEOID
YEAR	Year
POPULATION ESTIMATE	Population Estimate
TOTAL_KSI_RATE	People killed or severely injured per capita (100k)
PEDESTRIAN_KSI_RATE	Pedestrians killed or severely injured per capita (100k)
BICYCLIST_KSI_RATE	Bicyclists killed or severely injured per capita (100k)
TOTAL_KSI_RATE_5YR	People killed or severely injured over previous 5 years per capita
PEDESTRIAN_KSI_RATE_5YR	Pedestrians killed or severely injured over previous 5 years per capita
BICYCLIST_KSI_RATE_5YR	Bicyclists killed or severely injured over previous 5 years per capita

Future proofing

This application uses a population data source that has estimates up to 2050. Currently, the application is coded to handle data up to 2025 (current data provided only goes up to 2020) but can easily be adjusted for further years. Because this application redownloads and remakes its small database every time it's run, it will automatically include new years that are added to the crash dataset. Additionally, the CLI will adjust it's maximum allowed year depending on the highest year available in the crash dataset.

Report Insights

Nationwide trends tell us that fatal crashes across all modes have been on the rise since roughly 2010, and have significantly increased since the onset of the pandemic. This trend is similarly visible in Philadelphia, which you can view in another application I have made here - github - backend gihub. For this report, I would like to see if this national trend is roughly consistent across municipalities within this 9 county region.

There are a couple of challenges with using this report to analyze these trends. For one, you cannot average per capita values between municipalites as their populations are different. A municipality with a population of 100,000 might have a fatal crash rate of 10, whereas a municipality with a population of 1000 might have a fatal crash rate of 500. Their averaged crash rate would be 14.9, not 255. Therefore, you would need the underlying crash totals (which we reference in the data source section). There are a couple solutions to these problems that we can use. Rather than looking up each crash total, we can average per capita rates with some simple math. Since we provide population in the data set, this formula between two munipalities and their crash rates will give us a correct average crash rate (crash_rate_1 * population_1 + crash_rate_2 * population_2) / (population_1 + population_2)
To average all municipalities for a year, it will look something like this in excel =SUMPRODUCT(E2:E365,F2:F365) / SUM(E2:E365) This will allow us to do some simple calculations in excel without needing to reference another dataset

YEAR	AVG TOTAL_KSI_RATE	AVG PED_KSI_RATE	AVERAGE BIKE_KSI_RATE
2020	2506.844272	34.02237723	11.20683612
2019	3385.516475	49.33153406	13.2183977
2018	3437.825177	49.90189789	13.42778075
2017	3437.352007	50.00159346	15.50974737
2016	3463.074035	52.98107281	16.82023057
2015	3384.209629	49.54314625	18.03439286

Using the above formulas, we produce this table which shows us the average per capita rates across municipalities per year. There are a few trends visible here that I had not been suspecting. Between total ksi and pedestrian ksi rates, they stayed roughly the same between 2015 and 2019, and then drastically dropped in 2020, likely due to decreased transportation activity in the midst of the pandemic. More surprising to me is the steady rate in descreased bike ksi rate from 2015-2020, dropping by nearly 40% over 6 years. Another useful application of this table is that it allows for a simple analysis of individual municipalities compared to the baseline. Here are a few interesting outliers with populations greater than 10k (exclude extreme outliers with small sample sizes)

COUNTY	MCD_NAME	MCD_ID	YEAR	POPULATION_ESTIMATE	TOTAL_KSI_RATE	PEDESTRIAN_KSI_RATE	BICYCLIST_KSI_RATE
Camden County	Bellmawr Borough	3400704750	2019	11359	21727.3	44	35.2
Gloucester County	Deptford Township	3401517710	2018	30392	14839.4	36.2	19.7
Philadelphia County	Central Philadelphia	4210160103	2015	125180	2782.4	204.5	95.1
Camden County	Camden City	3400710000	2019	73562	1117.4	118.3	35.3

As shown here, Bellmar Borough and Depford township have exceedingly hight Total KSI Rates, 640% and 430% above the baseline respectively. Central Philadelphia and camden city both have excessively high pedestrian and bicyclist ksi rates, however, they also have a higher share of pedestrians and cyclists than other municipalities.

Future work, conclusions

One significantly lacking data point missing from this application is Vehicle Miles Travelled (VMT). As shown for Philadelphia and Camden City, it might be misleading to compare them to a baseline average across municipalities as they have a much higher rate of pedestrians and cyclists than most municipalities. If I were to work more on this project, I would create an additional category for all 3 field types, per vehicle miles travelled. This would require having data for estimated vehicle miles travelled across all modes in every municipality, but in many cases would be a far better indicator of what municipalities have excessively dangerous roads. Additionally, I think it's important to have seperate statistics for motorists, and motorcyclists. Having a combination of motorists, motorcyclists, pedestrians, and cyclists would provide a full understanding of the rates of deaths and severe injuries across munipalities. Lastly, it would be really interesting to see the numbers from 2021 through 2023. The last year of data being a pandemic year makes it difficult to see long term trends, and in my own dashboard of Philadelphia these most recent years have shown the highest rates of fatalities and severe injuries, especially among vulnerable road users.

One other note, in the dataset provided the following fields are listed

FIELD NAME	Description
TOTAL KILLED	Total number of people killed
TOTAL INJURED	Total number of people injured
TOTAL PERSONS	Total number of people killed or seriously injured

The sum of total killed and total injured has a completely different value than total persons. I used total persons for the report, which has a far higher value, as I assumed that total injured might have to do with lower severity injuries. I tested doing reports on total persons, and using the sum, and found similar trends (but vastly different rates). I just thought I would make a note of it as it seems a little misleading in the original dataset.

Thank you for reading this long report, it was a super interesting challenge and maybe I did a little more than I needed to, but I had fun doing it. I chose not to excessively comment the code, as the functions are fairly self descriptive. The exec function in db.py and report.py outlines the order in which functions are executed. There are some fairly simple class summaries and a step summary in cli.py. Please come with any specific questions about the code and my development process. Looking forward to discussing my work with you soon!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.env		.env
.gitignore		.gitignore
README.md		README.md
app.py		app.py
cli.py		cli.py
columns.py		columns.py
consts.py		consts.py
db.py		db.py
report.py		report.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DVRPC Project

Scenario

Setup

Application

Data Source

Database

Report

Population estimates

Math

Fields

Future proofing

Report Insights

Future work, conclusions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ckirby98/DVRPC-Project

Folders and files

Latest commit

History

Repository files navigation

DVRPC Project

Scenario

Setup

Application

Data Source

Database

Report

Population estimates

Math

Fields

Future proofing

Report Insights

Future work, conclusions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages