StateSet Data Studio

A comprehensive web-based platform for generating high-quality synthetic datasets to fine-tune Large Language Models (LLMs). Built with FastAPI backend and React frontend, featuring AI agent integration via MCP server.

✨ Features

🎯 Multi-Format Support: Process PDF, DOCX, TXT, HTML, and YouTube content
🤖 Multiple LLM Providers: Support for Llama, OpenAI, and other providers
📊 Quality Curation: AI-powered quality assessment and filtering
🔄 Export Formats: JSONL, Alpaca, ChatML, OpenAI fine-tuning formats
🤖 AI Agent Integration: MCP server for programmatic access
🐳 Docker Ready: Containerized deployment with Docker Compose
📈 Real-time Monitoring: Track job progress and system status
🎨 Modern UI: Clean, responsive React interface

🚀 Quick Start

The easiest way to get started is using the provided run script:

# Clone the repository
git clone https://github.com/stateset/stateset-data-studio.git
cd synthetic-data-studio

# Make sure you have installed all dependencies first
python run.py

This will start:

Backend API on http://localhost:8000
Frontend UI on http://localhost:3000
MCP Server on port 8000

📋 Prerequisites

Python 3.10+
Node.js 16+
Docker (optional, for containerized deployment)
Hugging Face account with API access
LLM API access (Llama, OpenAI, etc.)

🛠️ Installation

Option 1: Quick Setup (Recommended)

# Clone and setup
git clone https://github.com/stateset/stateset-data-studio.git
cd synthetic-data-studio

# Copy environment configuration
cp .env.example .env

# Edit .env with your API keys
nano .env

# Run the application
python run.py

Option 2: Manual Setup

Backend Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r backend/requirements.txt

Frontend Setup

cd frontend
npm install
npm start

🤖 AI Agent Integration

StateSet Data Studio includes an MCP (Model Control Protocol) server for AI agent integration:

# Start MCP server
python run_mcp_server.py

Example Agent Usage

# See examples/agent_example.py for complete integration
from mcp_client import MCPClient

client = MCPClient("http://localhost:8000")
# Create project, upload documents, generate data

📖 Usage Guide

1. System Configuration

Access the web interface at http://localhost:3000
Configure your LLM provider settings
Verify vLLM server connection

2. Create a Project

Click "New Project"
Enter project details
Start data generation workflow

3. Data Generation Workflow

Ingest: Upload documents or provide URLs
Generate: Create QA pairs or Chain of Thought examples
Curate: Filter content based on quality thresholds
Export: Save in your preferred format

🐳 Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# Or build manually
docker build -t synthetic-data-studio .
docker run -p 8000:8000 -p 3000:3000 synthetic-data-studio

🔧 Configuration

Environment Variables

# Required
HF_TOKEN=your_hugging_face_token
LLAMA_API_KEY=your_llama_api_key
OPENAI_API_KEY=your_openai_api_key

# Optional
HOST=0.0.0.0
PORT=8000
DATABASE_URL=sqlite:///./synthetic_data.db

Supported File Formats

Documents: PDF, DOCX, TXT, HTML
Media: YouTube URLs (video transcripts)
Archives: ZIP files containing documents

📊 API Documentation

Once running, access the API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI Schema: http://localhost:8000/openapi.json

🧪 Testing

# Backend tests
python -m pytest tests/ -v

# Frontend tests
cd frontend && npm test

# Integration tests
python run_api_tests.py

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔒 Security

See our Security Policy for information about reporting vulnerabilities.

🙏 Acknowledgments

Built with FastAPI and React
LLM integration powered by vLLM
UI components from Tailwind CSS

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: MCP Server Docs

StateSet Data Studio - Transform documents into high-quality synthetic datasets

Features • Quick Start • Contributing • License

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
backend		backend
configs		configs
data		data
examples		examples
frontend		frontend
pages		pages
synthetic_data_kit		synthetic_data_kit
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.mock_vllm		Dockerfile.mock_vllm
LICENSE		LICENSE
MCP_SERVER_DOCS.md		MCP_SERVER_DOCS.md
ONE_CLICK_README.md		ONE_CLICK_README.md
README.md		README.md
SECURITY.md		SECURITY.md
api_extensions.py		api_extensions.py
api_workflow.py		api_workflow.py
app.py		app.py
complete_workflow_test.py		complete_workflow_test.py
docker-compose.yml		docker-compose.yml
fixed_api.py		fixed_api.py
mcp_server.py		mcp_server.py
minimal_app.py		minimal_app.py
mock_vllm_server.py		mock_vllm_server.py
one_click_workflow.py		one_click_workflow.py
package-lock.json		package-lock.json
package.json		package.json
production_test.py		production_test.py
query_db.py		query_db.py
run.py		run.py
run_api_tests.py		run_api_tests.py
run_mcp_server.py		run_mcp_server.py
run_server.py		run_server.py
simple_production_test.py		simple_production_test.py
standalone_server.py		standalone_server.py
start_server.py		start_server.py
workflow_demo.py		workflow_demo.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

StateSet Data Studio

✨ Features

🚀 Quick Start

📋 Prerequisites

🛠️ Installation

Option 1: Quick Setup (Recommended)

Option 2: Manual Setup

Backend Setup

Frontend Setup

🤖 AI Agent Integration

Example Agent Usage

📖 Usage Guide

1. System Configuration

2. Create a Project

3. Data Generation Workflow

🐳 Docker Deployment

🔧 Configuration

Environment Variables

Supported File Formats

📊 API Documentation

🧪 Testing

🤝 Contributing

📝 License

🔒 Security

🙏 Acknowledgments

📞 Support

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages