Skip to content

stateset/stateset-data-studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

StateSet Data Studio

License: MIT Python 3.10+ Node.js 16+ Docker CI

A comprehensive web-based platform for generating high-quality synthetic datasets to fine-tune Large Language Models (LLMs). Built with FastAPI backend and React frontend, featuring AI agent integration via MCP server.

✨ Features

  • 🎯 Multi-Format Support: Process PDF, DOCX, TXT, HTML, and YouTube content
  • πŸ€– Multiple LLM Providers: Support for Llama, OpenAI, and other providers
  • πŸ“Š Quality Curation: AI-powered quality assessment and filtering
  • πŸ”„ Export Formats: JSONL, Alpaca, ChatML, OpenAI fine-tuning formats
  • πŸ€– AI Agent Integration: MCP server for programmatic access
  • 🐳 Docker Ready: Containerized deployment with Docker Compose
  • πŸ“ˆ Real-time Monitoring: Track job progress and system status
  • 🎨 Modern UI: Clean, responsive React interface

πŸš€ Quick Start

The easiest way to get started is using the provided run script:

# Clone the repository
git clone https://github.com/stateset/stateset-data-studio.git
cd synthetic-data-studio

# Make sure you have installed all dependencies first
python run.py

This will start:

πŸ“‹ Prerequisites

  • Python 3.10+
  • Node.js 16+
  • Docker (optional, for containerized deployment)
  • Hugging Face account with API access
  • LLM API access (Llama, OpenAI, etc.)

πŸ› οΈ Installation

Option 1: Quick Setup (Recommended)

# Clone and setup
git clone https://github.com/stateset/stateset-data-studio.git
cd synthetic-data-studio

# Copy environment configuration
cp .env.example .env

# Edit .env with your API keys
nano .env

# Run the application
python run.py

Option 2: Manual Setup

Backend Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r backend/requirements.txt

Frontend Setup

cd frontend
npm install
npm start

πŸ€– AI Agent Integration

StateSet Data Studio includes an MCP (Model Control Protocol) server for AI agent integration:

# Start MCP server
python run_mcp_server.py

Example Agent Usage

# See examples/agent_example.py for complete integration
from mcp_client import MCPClient

client = MCPClient("http://localhost:8000")
# Create project, upload documents, generate data

πŸ“– Usage Guide

1. System Configuration

  1. Access the web interface at http://localhost:3000
  2. Configure your LLM provider settings
  3. Verify vLLM server connection

2. Create a Project

  1. Click "New Project"
  2. Enter project details
  3. Start data generation workflow

3. Data Generation Workflow

  1. Ingest: Upload documents or provide URLs
  2. Generate: Create QA pairs or Chain of Thought examples
  3. Curate: Filter content based on quality thresholds
  4. Export: Save in your preferred format

🐳 Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# Or build manually
docker build -t synthetic-data-studio .
docker run -p 8000:8000 -p 3000:3000 synthetic-data-studio

πŸ”§ Configuration

Environment Variables

# Required
HF_TOKEN=your_hugging_face_token
LLAMA_API_KEY=your_llama_api_key
OPENAI_API_KEY=your_openai_api_key

# Optional
HOST=0.0.0.0
PORT=8000
DATABASE_URL=sqlite:///./synthetic_data.db

Supported File Formats

  • Documents: PDF, DOCX, TXT, HTML
  • Media: YouTube URLs (video transcripts)
  • Archives: ZIP files containing documents

πŸ“Š API Documentation

Once running, access the API documentation at:

πŸ§ͺ Testing

# Backend tests
python -m pytest tests/ -v

# Frontend tests
cd frontend && npm test

# Integration tests
python run_api_tests.py

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”’ Security

See our Security Policy for information about reporting vulnerabilities.

πŸ™ Acknowledgments

πŸ“ž Support


StateSet Data Studio - Transform documents into high-quality synthetic datasets

Features β€’ Quick Start β€’ Contributing β€’ License

About

StateSet Data Studio

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors