SchemaGen is a web application for managing JSON schemas and extracting structured data from PDF documents.
- Schema Management: Create, view, update, and delete JSON schemas
- Dataset Management: Organize files into logical datasets
- File Upload: Upload PDF files for processing
- AI Schema Generation: Generate schemas using AI from natural language conversations
- Data Extraction: Extract structured data from PDFs according to schemas
- Python 3.8+
- Node.js 16+
- Optional: Ollama for local LLM support
- Clone the repository
- Install Python dependencies:
pip install -r requirements.txt - Install frontend dependencies:
cd frontend && npm install - Build the frontend:
cd frontend && npm run build
Create a .env file in the root directory with the following settings:
# Storage configuration
STORAGE_TYPE=local # 'local' or 's3'
LOCAL_STORAGE_PATH=.data
# S3 configuration (only needed if STORAGE_TYPE=s3)
S3_BUCKET_NAME=your-bucket-name
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=us-west-1
# AI configuration
USE_LOCAL_MODEL=true # Set to 'true' to use Ollama
OLLAMA_MODEL=deepseek-r1:14b # Only used if USE_LOCAL_MODEL=true
OLLAMA_API_URL=http://localhost:11434/api/chat # Ollama API URL
# DeepSeek API (only used if USE_LOCAL_MODEL=false)
DEEPSEEK_API_KEY=your-api-key
# Database configuration
DATABASE_URL=sqlite:///schemas.db
Run the Flask application:
python app.py
The application will be available at http://localhost:5000
-
Start the Flask backend:
python app.py -
In a separate terminal, start the Vite development server:
cd frontend && npm run dev
The frontend will be available at http://localhost:5173
The data extraction pipeline takes PDF files, converts them to markdown, and then uses AI to extract structured data according to a schema.
- Upload PDF files to a dataset
- Create or select a schema
- Associate the schema with the dataset
- Navigate to the dataset view
- Click "Extract Data" to start the extraction process
The extraction can also be performed using the command-line tool:
./extract_data.py <dataset_name> [--source <source>]
Examples:
# Extract data from a local dataset
./extract_data.py financial_reports
# Extract data from an S3 dataset
./extract_data.py quarterly_reports --source s3
POST /api/extract/<source>/<dataset_name>: Extracts data from the specified dataset
- Dataset Selection: The process starts by selecting a dataset with PDF files
- Schema Association: A schema must be associated with the dataset
- PDF to Markdown Conversion: PDFs are converted to markdown format for easier processing
- Data Extraction: An AI model extracts structured data according to the schema
- JSON Output: The extracted data is saved as JSON files in a new directory
The output files are stored in a directory named <dataset_name>-extracted within the storage location.
.data/<dataset_name>/: Original PDF files.data/<dataset_name>-md/: Intermediate markdown files.data/<dataset_name>-extracted/: Final JSON extraction results
The following environment variables can be set to configure the application:
OLLAMA_MODEL: Model name for local Ollama (default: deepseek-r1:14b)USE_LOCAL_MODEL: Set to 'true' to use local model, 'false' for API (default: true)OLLAMA_API_URL: URL for Ollama API (default: http://localhost:11434/api/chat)DATABASE_URL: Database connection URL (default: sqlite:///schemas.db)DEEPSEEK_API_KEY: API key for DeepSeek cloud API (required if using API)DEEPSEEK_API_URL: URL for DeepSeek API (default: https://api.deepseek.com/v1/chat/completions)
Run the frontend tests:
cd frontend && npm test
For coverage reports:
cd frontend && npm run test:coverage
Run the backend tests:
pytest
For coverage reports:
pytest --cov=. --cov-report=term --cov-report=html
Note: When running tests with coverage, specify the coverage options directly in the command line rather than in
pytest.ini. This avoids issues with pytest interpreting coverage options as command-line arguments.
This project uses GitHub Actions for continuous integration:
- JavaScript Workflow: Runs linting and tests for the frontend code.
- Python Workflow: Runs linting and tests for the backend code.
You can test GitHub Actions workflows locally using act:
# Install act (on Ubuntu)
curl https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash
# Run all workflows in dry-run mode
act -n
# Run a specific workflow in dry-run mode
act -W .github/workflows/javascript.yml -n
act -W .github/workflows/python.yml -n
# Run a workflow for real
act -W .github/workflows/javascript.ymlWhen using act you might encounter:
-
Node.js dependency conflicts: Use the
--legacy-peer-depsflag:cd frontend && npm ci --legacy-peer-deps
-
Python package compatibility issues: Some packages may not be available for specific Python versions. Adjust your requirements.txt as needed:
# In requirements.txt pandas==2.0.3 # For Python 3.8 compatibility
-
Coverage options duplication: Specify coverage options directly in the command line rather than in
pytest.ini:pytest --cov=. --cov-report=term --cov-report=html
We use several tools to maintain code quality:
# Run all linting tools (black, flake8, isort, pylint)
python scripts/lint.py
# Run in check-only mode (no changes)
python scripts/lint.py --check
# Individual tools
black .
isort --profile black .
flake8 .
pylint .
mypy .The frontend uses ESLint for TypeScript/React code quality:
cd frontend
# Run ESLint
npm run lintThis project uses GitHub Actions for continuous integration. The following checks are run on each pull request:
- Python tests (pytest)
- JavaScript tests (Jest)
- Python linting (black, flake8, isort, pylint)
- JavaScript linting (ESLint)
The GitHub Actions workflow configuration can be found in the .github/workflows directory.