A comprehensive web-based platform for generating high-quality synthetic datasets to fine-tune Large Language Models (LLMs). Built with FastAPI backend and React frontend, featuring AI agent integration via MCP server.
- π― Multi-Format Support: Process PDF, DOCX, TXT, HTML, and YouTube content
- π€ Multiple LLM Providers: Support for Llama, OpenAI, and other providers
- π Quality Curation: AI-powered quality assessment and filtering
- π Export Formats: JSONL, Alpaca, ChatML, OpenAI fine-tuning formats
- π€ AI Agent Integration: MCP server for programmatic access
- π³ Docker Ready: Containerized deployment with Docker Compose
- π Real-time Monitoring: Track job progress and system status
- π¨ Modern UI: Clean, responsive React interface
The easiest way to get started is using the provided run script:
# Clone the repository
git clone https://github.com/stateset/stateset-data-studio.git
cd synthetic-data-studio
# Make sure you have installed all dependencies first
python run.pyThis will start:
- Backend API on http://localhost:8000
- Frontend UI on http://localhost:3000
- MCP Server on port 8000
- Python 3.10+
- Node.js 16+
- Docker (optional, for containerized deployment)
- Hugging Face account with API access
- LLM API access (Llama, OpenAI, etc.)
# Clone and setup
git clone https://github.com/stateset/stateset-data-studio.git
cd synthetic-data-studio
# Copy environment configuration
cp .env.example .env
# Edit .env with your API keys
nano .env
# Run the application
python run.py# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r backend/requirements.txtcd frontend
npm install
npm startStateSet Data Studio includes an MCP (Model Control Protocol) server for AI agent integration:
# Start MCP server
python run_mcp_server.py# See examples/agent_example.py for complete integration
from mcp_client import MCPClient
client = MCPClient("http://localhost:8000")
# Create project, upload documents, generate data- Access the web interface at http://localhost:3000
- Configure your LLM provider settings
- Verify vLLM server connection
- Click "New Project"
- Enter project details
- Start data generation workflow
- Ingest: Upload documents or provide URLs
- Generate: Create QA pairs or Chain of Thought examples
- Curate: Filter content based on quality thresholds
- Export: Save in your preferred format
# Build and run with Docker Compose
docker-compose up -d
# Or build manually
docker build -t synthetic-data-studio .
docker run -p 8000:8000 -p 3000:3000 synthetic-data-studio# Required
HF_TOKEN=your_hugging_face_token
LLAMA_API_KEY=your_llama_api_key
OPENAI_API_KEY=your_openai_api_key
# Optional
HOST=0.0.0.0
PORT=8000
DATABASE_URL=sqlite:///./synthetic_data.db- Documents: PDF, DOCX, TXT, HTML
- Media: YouTube URLs (video transcripts)
- Archives: ZIP files containing documents
Once running, access the API documentation at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI Schema: http://localhost:8000/openapi.json
# Backend tests
python -m pytest tests/ -v
# Frontend tests
cd frontend && npm test
# Integration tests
python run_api_tests.pyWe welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
See our Security Policy for information about reporting vulnerabilities.
- Built with FastAPI and React
- LLM integration powered by vLLM
- UI components from Tailwind CSS
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: MCP Server Docs
StateSet Data Studio - Transform documents into high-quality synthetic datasets
Features β’ Quick Start β’ Contributing β’ License