A Python-based data extraction and analysis pipeline for processing food establishment inspection scores from Lexington-Fayette County Health Department reports.
This project extracts food safety inspection data from PDF reports, cleans and transforms the data, and enriches it with detailed violation code descriptions to enable analysis of food establishment safety compliance. The system is designed to accumulate historical data by appending new scrapes to existing data, allowing you to track establishment performance over time.
- PDF Data Extraction: Automatically extracts inspection scores and violation codes from PDF reports
- Historical Tracking: Appends new scrapes to existing data with a scrape date timestamp
- Data Cleaning: Transforms raw extracted data into clean, structured CSV format
- Violation Enrichment: Joins inspection records with detailed violation code descriptions
- Multi-page Processing: Handles large PDF reports with multiple pages and tables
- Trend Analysis: Track establishments over time to identify repeat violations and compliance patterns
pip install pandas tabula-py pdfplumber camelot-pyNote: tabula-py requires Java to be installed and available on your PATH.
LexKYFoodScores/
├── download_pdf.py # Downloads latest PDF from LFCHD website
├── run_pipeline.py # Orchestrator script (runs all steps)
├── LexFoodScoresExtract.py # Step 1: Extracts data from PDF reports
├── transform_food_scores.py # Step 2: Cleans and transforms raw data
├── JoinScoresViolations.py # Step 3: Joins scores with violation descriptions
├── CodeViolations.csv # Reference table of violation codes
├── PDFs/ # Downloaded PDFs (historical archive)
├── food_scores.csv # Raw extracted data (intermediate)
├── food_scores_cleaned.csv # Cleaned data with proper headers (intermediate)
└── joined_scores_violations.csv # Final enriched dataset
Option 1: Windows Batch File (Easiest)
Double-click run_pipeline.bat or run from command prompt:
run_pipeline.batOption 2: Python Command
python run_pipeline.pyBoth methods will:
- Download the latest PDF from the LFCHD website
- Check MD5 hash - skip processing if PDF is unchanged (perfect for daily runs!)
- Store PDFs in the
PDFs/directory with timestamps for historical tracking - Run all three processing steps
- Generate the final
joined_scores_violations.csvfile
Alternative: Manual PDF Download
You can also download the PDF separately first:
python download_pdf.py
python run_pipeline.py --scores-pdf "PDFs/Food-Retail_Inspections-06.2024-06.2025.pdf"Options:
--download: Force download of latest PDF even if you specify --scores-pdf--scores-pdf PATH: Path to the inspection scores PDF (if not provided, downloads latest)--scrape-date YYYY-MM-DD: Date of scrape (defaults to today)--scores-csv PATH: Output path for raw data (default:food_scores.csv)--cleaned-csv PATH: Output path for cleaned data (default:food_scores_cleaned.csv)
If you prefer to run each step individually:
Extract inspection scores and violation codes from the health department PDF:
python LexFoodScoresExtract.py \
--scores-pdf "Food-Retail_Inspections-06.2024-06.2025.pdf" \
--scores-csv food_scores.csv \
--scrape-date 2025-10-06Key Features:
- Appends to existing data: New scrapes are added to
food_scores.csvrather than replacing it - Automatic scrape date: If you don't provide
--scrape-date, it defaults to today's date - Historical tracking: Each row is tagged with when it was scraped, allowing you to track which inspection reports appeared in which PDF reports over time
Example with auto-date:
python LexFoodScoresExtract.py \
--scores-pdf "Food-Retail_Inspections-06.2024-06.2025.pdf" \
--scores-csv food_scores.csvTransform the raw extracted data into a clean format with proper headers:
python transform_food_scores.py \
--input food_scores.csv \
--output food_scores_cleaned.csvThis step:
- Renames columns to meaningful headers (Permit #, Establishment Name, Address, Date, ScrapeDate, etc.)
- Filters out non-data rows
- Parses inspection dates and scrape dates
- Splits multiple violations into separate rows
- Preserves scrape date metadata for historical tracking
Enrich the cleaned data with detailed violation descriptions:
python JoinScoresViolations.pyThis produces joined_scores_violations.csv with complete inspection records including:
- Establishment details (name, address, permit)
- Inspection date, type, and score
- Scrape date (when this data was captured)
- Violation codes and their full descriptions
- Violation categories
- Permit #: Establishment permit number
- Establishment Name: Name of the food establishment
- Address: Physical address
- Date: Inspection date (when the inspection occurred)
- Inspection Type: Type of inspection conducted
- Food or Retail: Classification
- Score: Inspection score
- Violations: Violation code
- ScrapeDate: Date when this data was scraped from the PDF report
- Page: Page number in the source PDF
- Table: Table number on the page
- SourceFile: Name of the source PDF file
- Violation Code: Code from reference table
- Category: Violation category (e.g., Supervision, Employee Health)
- Violation Explanation: Detailed description of the violation
Once you have the final joined dataset, you can analyze:
- Historical performance: Track how individual establishments' scores change over time
- Repeat offenders: Identify establishments that consistently appear in reports with violations
- Violation trends: See which violations are most common across all establishments
- Disappearing establishments: Detect establishments that stopped appearing in reports (closed or improved?)
- Seasonal patterns: Analyze if certain times of year have more violations
- New vs. routine inspections: Compare scores between different inspection types over time
import pandas as pd
df = pd.read_csv('joined_scores_violations.csv')
# Find establishments that appeared in multiple scrapes
repeat_establishments = df.groupby('Permit #')['ScrapeDate'].nunique()
establishments_with_multiple_scrapes = repeat_establishments[repeat_establishments > 1]
# Track score changes over time for a specific establishment
permit = '12345'
score_history = df[df['Permit #'] == permit][['Date', 'ScrapeDate', 'Score']].drop_duplicates()
# Find establishments that disappeared (were in early scrapes but not recent ones)
early_scrapes = df[df['ScrapeDate'] < '2025-06-01']['Permit #'].unique()
recent_scrapes = df[df['ScrapeDate'] >= '2025-06-01']['Permit #'].unique()
disappeared = set(early_scrapes) - set(recent_scrapes)Set up automatic daily checks for new inspection data:
-
Open Task Scheduler
- Press
Win + R, typetaskschd.msc, press Enter
- Press
-
Create Basic Task
- Click "Create Basic Task" in the right panel
- Name:
Food Inspection Data Update - Description:
Daily check for new food inspection data
-
Set Trigger
- Choose "Daily"
- Set start time (e.g., 6:00 AM)
- Recur every: 1 day
-
Set Action
- Choose "Start a program"
- Program/script:
C:\PythonCode\LexKYFoodScores\run_pipeline.bat - Start in:
C:\PythonCode\LexKYFoodScores - (Adjust path to match your installation)
-
Finish
- Check "Open Properties dialog" to review settings
Why Daily?
- MD5 check ensures no duplicate processing
- Script exits quickly if no new data (< 5 seconds)
- Always have latest data when it's published
You can also run manually whenever you want fresh data - the MD5 check prevents redundant processing.
Data is sourced from the Lexington-Fayette County Health Department food establishment inspection reports.
This project is provided as-is for data analysis and transparency purposes.
Contributions are welcome! Please feel free to submit issues or pull requests.