Lexington KY Food Inspection Scores

A Python-based data extraction and analysis pipeline for processing food establishment inspection scores from Lexington-Fayette County Health Department reports.

Overview

This project extracts food safety inspection data from PDF reports, cleans and transforms the data, and enriches it with detailed violation code descriptions to enable analysis of food establishment safety compliance. The system is designed to accumulate historical data by appending new scrapes to existing data, allowing you to track establishment performance over time.

Features

PDF Data Extraction: Automatically extracts inspection scores and violation codes from PDF reports
Historical Tracking: Appends new scrapes to existing data with a scrape date timestamp
Data Cleaning: Transforms raw extracted data into clean, structured CSV format
Violation Enrichment: Joins inspection records with detailed violation code descriptions
Multi-page Processing: Handles large PDF reports with multiple pages and tables
Trend Analysis: Track establishments over time to identify repeat violations and compliance patterns

Requirements

pip install pandas tabula-py pdfplumber camelot-py

Note: tabula-py requires Java to be installed and available on your PATH.

Project Structure

LexKYFoodScores/
├── download_pdf.py               # Downloads latest PDF from LFCHD website
├── run_pipeline.py               # Orchestrator script (runs all steps)
├── LexFoodScoresExtract.py       # Step 1: Extracts data from PDF reports
├── transform_food_scores.py      # Step 2: Cleans and transforms raw data
├── JoinScoresViolations.py       # Step 3: Joins scores with violation descriptions
├── CodeViolations.csv            # Reference table of violation codes
├── PDFs/                         # Downloaded PDFs (historical archive)
├── food_scores.csv               # Raw extracted data (intermediate)
├── food_scores_cleaned.csv       # Cleaned data with proper headers (intermediate)
└── joined_scores_violations.csv  # Final enriched dataset

Usage

Quick Start: Automated Download and Processing

Option 1: Windows Batch File (Easiest)

Double-click run_pipeline.bat or run from command prompt:

run_pipeline.bat

Option 2: Python Command

python run_pipeline.py

Both methods will:

Download the latest PDF from the LFCHD website
Check MD5 hash - skip processing if PDF is unchanged (perfect for daily runs!)
Store PDFs in the PDFs/ directory with timestamps for historical tracking
Run all three processing steps
Generate the final joined_scores_violations.csv file

Alternative: Manual PDF Download

You can also download the PDF separately first:

python download_pdf.py
python run_pipeline.py --scores-pdf "PDFs/Food-Retail_Inspections-06.2024-06.2025.pdf"

Options:

--download: Force download of latest PDF even if you specify --scores-pdf
--scores-pdf PATH: Path to the inspection scores PDF (if not provided, downloads latest)
--scrape-date YYYY-MM-DD: Date of scrape (defaults to today)
--scores-csv PATH: Output path for raw data (default: food_scores.csv)
--cleaned-csv PATH: Output path for cleaned data (default: food_scores_cleaned.csv)

Manual Step-by-Step Usage

If you prefer to run each step individually:

Step 1: Extract Data from PDF

Extract inspection scores and violation codes from the health department PDF:

python LexFoodScoresExtract.py \
    --scores-pdf "Food-Retail_Inspections-06.2024-06.2025.pdf" \
    --scores-csv food_scores.csv \
    --scrape-date 2025-10-06

Key Features:

Appends to existing data: New scrapes are added to food_scores.csv rather than replacing it
Automatic scrape date: If you don't provide --scrape-date, it defaults to today's date
Historical tracking: Each row is tagged with when it was scraped, allowing you to track which inspection reports appeared in which PDF reports over time

Example with auto-date:

python LexFoodScoresExtract.py \
    --scores-pdf "Food-Retail_Inspections-06.2024-06.2025.pdf" \
    --scores-csv food_scores.csv

Step 2: Clean and Transform Data

Transform the raw extracted data into a clean format with proper headers:

python transform_food_scores.py \
    --input food_scores.csv \
    --output food_scores_cleaned.csv

This step:

Renames columns to meaningful headers (Permit #, Establishment Name, Address, Date, ScrapeDate, etc.)
Filters out non-data rows
Parses inspection dates and scrape dates
Splits multiple violations into separate rows
Preserves scrape date metadata for historical tracking

Step 3: Join with Violation Descriptions

Enrich the cleaned data with detailed violation descriptions:

python JoinScoresViolations.py

This produces joined_scores_violations.csv with complete inspection records including:

Establishment details (name, address, permit)
Inspection date, type, and score
Scrape date (when this data was captured)
Violation codes and their full descriptions
Violation categories

Data Schema

Final Output Columns (`joined_scores_violations.csv`)

Permit #: Establishment permit number
Establishment Name: Name of the food establishment
Address: Physical address
Date: Inspection date (when the inspection occurred)
Inspection Type: Type of inspection conducted
Food or Retail: Classification
Score: Inspection score
Violations: Violation code
ScrapeDate: Date when this data was scraped from the PDF report
Page: Page number in the source PDF
Table: Table number on the page
SourceFile: Name of the source PDF file
Violation Code: Code from reference table
Category: Violation category (e.g., Supervision, Employee Health)
Violation Explanation: Detailed description of the violation

Example Analysis

Once you have the final joined dataset, you can analyze:

Historical performance: Track how individual establishments' scores change over time
Repeat offenders: Identify establishments that consistently appear in reports with violations
Violation trends: See which violations are most common across all establishments
Disappearing establishments: Detect establishments that stopped appearing in reports (closed or improved?)
Seasonal patterns: Analyze if certain times of year have more violations
New vs. routine inspections: Compare scores between different inspection types over time

Sample Historical Analysis Queries

import pandas as pd

df = pd.read_csv('joined_scores_violations.csv')

# Find establishments that appeared in multiple scrapes
repeat_establishments = df.groupby('Permit #')['ScrapeDate'].nunique()
establishments_with_multiple_scrapes = repeat_establishments[repeat_establishments > 1]

# Track score changes over time for a specific establishment
permit = '12345'
score_history = df[df['Permit #'] == permit][['Date', 'ScrapeDate', 'Score']].drop_duplicates()

# Find establishments that disappeared (were in early scrapes but not recent ones)
early_scrapes = df[df['ScrapeDate'] < '2025-06-01']['Permit #'].unique()
recent_scrapes = df[df['ScrapeDate'] >= '2025-06-01']['Permit #'].unique()
disappeared = set(early_scrapes) - set(recent_scrapes)

Automated Scheduling

Windows Task Scheduler (Recommended)

Set up automatic daily checks for new inspection data:

Open Task Scheduler
- Press Win + R, type taskschd.msc, press Enter
Create Basic Task
- Click "Create Basic Task" in the right panel
- Name: Food Inspection Data Update
- Description: Daily check for new food inspection data
Set Trigger
- Choose "Daily"
- Set start time (e.g., 6:00 AM)
- Recur every: 1 day
Set Action
- Choose "Start a program"
- Program/script: C:\PythonCode\LexKYFoodScores\run_pipeline.bat
- Start in: C:\PythonCode\LexKYFoodScores
- (Adjust path to match your installation)
Finish
- Check "Open Properties dialog" to review settings

Why Daily?

MD5 check ensures no duplicate processing
Script exits quickly if no new data (< 5 seconds)
Always have latest data when it's published

Manual Schedule

You can also run manually whenever you want fresh data - the MD5 check prevents redundant processing.

Data Source

Data is sourced from the Lexington-Fayette County Health Department food establishment inspection reports.

License

This project is provided as-is for data analysis and transparency purposes.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lexington KY Food Inspection Scores

Overview

Features

Requirements

Project Structure

Usage

Quick Start: Automated Download and Processing

Manual Step-by-Step Usage

Step 1: Extract Data from PDF

Step 2: Clean and Transform Data

Step 3: Join with Violation Descriptions

Data Schema

Final Output Columns (`joined_scores_violations.csv`)

Example Analysis

Sample Historical Analysis Queries

Automated Scheduling

Windows Task Scheduler (Recommended)

Manual Schedule

Data Source

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
CodeViolations.csv		CodeViolations.csv
JoinScoresViolations.py		JoinScoresViolations.py
LexFoodScoresExtract.py		LexFoodScoresExtract.py
README.md		README.md
download_pdf.py		download_pdf.py
joined_scores_violations.csv		joined_scores_violations.csv
requirements.txt		requirements.txt
run_pipeline.bat		run_pipeline.bat
run_pipeline.py		run_pipeline.py
sync_to_sheets.py		sync_to_sheets.py
transform_food_scores.py		transform_food_scores.py

Folders and files

Latest commit

History

Repository files navigation

Lexington KY Food Inspection Scores

Overview

Features

Requirements

Project Structure

Usage

Quick Start: Automated Download and Processing

Manual Step-by-Step Usage

Step 1: Extract Data from PDF

Step 2: Clean and Transform Data

Step 3: Join with Violation Descriptions

Data Schema

Final Output Columns (joined_scores_violations.csv)

Example Analysis

Sample Historical Analysis Queries

Automated Scheduling

Windows Task Scheduler (Recommended)

Manual Schedule

Data Source

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Final Output Columns (`joined_scores_violations.csv`)

Packages