Synthetic data generation app using Local LLMs (Ollama, llama.cpp, etc.) with a Gradio web interface. Generates datasets and documents in multiple formats for testing and development.
- CSV & Excel Files: Generate structured tabular data
- Schema Detection: LLM-powered column header generation based on subject matter
- Statistical Distributions: Log-normal distributions for salaries, power-law for transactions
- Temporal Consistency: Date ranges with seasonal patterns
- Geographic Consistency: Validated country-city relationships
- Correlation Preservation: Related fields maintain relationships
- Multiple Formats: Word (.docx), PDF (.pdf), Text (.txt), Markdown (.md)
- Document Types: Whitepapers, articles, reports, proposals, design documents
- Formatting: Automatic styling and structure
- Iterative Generation: Handles long-form content with multiple LLM calls
- Product Catalog Generator: Fast-path for product data
- Domain Constraints: Category-specific validation rules (electronics, automotive, etc.)
- Data Quality Options: Control correlations and error patterns
- Preview & Validation: Real-time data preview before download
- Batch Processing: CLI support for automated generation
- Python: 3.8 or higher
- Local LLM Server: REQUIRED. You must have a local LLM server running (e.g., Ollama or llama.cpp).
- Memory: 4GB RAM minimum (8GB+ recommended for 7B models)
- Storage: ~500MB for dependencies
git clone https://github.com/benwalkerai/Portfolio-GenAISyntheticDataCreator.git
cd Portfolio-GenAISyntheticDataCreatorCopy the example environment file:
cp .env.example .envEdit .env to point to your local LLM server:
LLM_API_BASE=http://localhost:11434/v1 # For Ollama
# LLM_API_BASE=http://localhost:8080/v1 # For llama-server
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:8b # Match your loaded modelThis project supports uv for fast dependency management.
# Install dependencies and run
uv run main.pypip install -r requirements.txt
python main.pyThe application will launch at http://localhost:7860
You can run the application in a container. Note that to access a local LLM running on your host machine from inside the container, you may need to use host.docker.internal as the host in your .env file (e.g., http://host.docker.internal:11434/v1).
# Build and start
docker-compose up --build- Select File Format: Choose from Excel, CSV, Word, PDF, Text, or Markdown
- Enter Subject: Describe your data (e.g., "Employee salary records")
- Set Parameters:
- Data Files: Specify rows and columns
- Documents: Specify number of pages and document type
- Configure LLM:
- Verify model name and URL in the settings accordion if needed.
- Advanced Options (optional):
- Enable temporal coherence for time-series data
- Add correlations between related fields
- Include realistic error patterns
- Generate: Click generate and wait for completion
- Preview & Download: Review data and download the file
You can generate data without the UI using create_data.py.
Basic Usage:
uv run create_data.py [OPTIONS]Common Examples:
Generate a large customer CSV:
uv run create_data.py --csv --subject "Customer CRM records" --rows 5000 --columns 25Generate a generic Excel product catalog:
uv run create_data.py --xlsx --subject "Office Supplies" --rows 200 --columns 10Generate a 10-page whitepaper in Word format:
uv run create_data.py --docx --subject "Future of AI" --pages 10 --doc-type whitepaperOptions Reference:
| Category | Flag | Description |
|---|---|---|
| Format | --csv, --xlsx |
Output format for tabular data |
--docx, --pdf, --txt, --md |
Output format for documents | |
| Data | --rows INT |
Number of rows (default: 100) |
--columns INT |
Number of columns (default: 10) | |
| Docs | --pages INT |
Number of pages (default: 3) |
--doc-type TEXT |
generic, whitepaper, article, report, proposal, design | |
| General | --subject "TEXT" |
Required. Topic to guide generation |
-d, --dest PATH |
Output directory (default: current dir) | |
| Realism | --no-correlations |
Disable logical data relationships |
--missingness FLOAT |
Rate of missing values (0.0 - 0.3) |
synthetic-data-generator/
├── config/
│ ├── settings.py # Application configuration
│ └── logging_config.py # Logging setup
├── generators/
│ ├── data_generator.py # Main orchestrator
│ ├── document_generator.py # Document generation
│ ├── excel_generator.py # Excel data orchestrator
│ ├── constants.py # Static data & reference tables
│ ├── llm_utils.py # LLM interaction layer
│ ├── value_generators.py # Single-value generators (names, IDs, dates)
│ ├── schema_templates.py # Fallback schemas
│ ├── validators.py # Data quality validation
│ ├── employee_generator.py # Employee dataset generation
│ ├── product_generator.py # Product catalog generation
│ └── sales_generator.py # Sales transaction generation
├── ui/
│ └── interface.py # Gradio web interface
├── utils/
│ ├── helpers.py # Utility functions
│ └── product_constraints.py # Product business rules
├── tests/
│ ├── test_document_cleaning.py
│ └── test_markdown_conversion.py
├── main.py # Web UI entry point
├── create_data.py # CLI entry point
└── requirements.txt # Python dependencies
Use the .env file to control your LLM connection. The application uses the standard OpenAI API format, which is compatible with most local servers.
- LLM_API_BASE: The endpoint URL (must include
/v1usually). - LLM_MODEL: The exact model name string.
Contributions are welcome!
- Additional file formats (JSON, Parquet, SQL)
- More document types (presentations, spreadsheets)
- Enhanced validation rules
This project is licensed under the MIT License.
- Built with Gradio for the web interface
- Powered by Local LLMs (Ollama, llama.cpp)
- Uses OpenAI API for standardized communication
Version: 1.0.0
Author: Ben Walker
