SysGen is a powerful CLI tool that creates high-quality synthetic datasets from documents using the Gemini API. It intelligently chunks documents, generates comprehensive questions, and produces detailed answers for machine learning training datasets.
- Smart Document Chunking: Automatically splits large documents into manageable chunks with overlap
- Comprehensive Question Generation: Extracts ALL possible questions from content using advanced AI prompting
- High-Quality Answer Generation: Creates detailed 4-5 sentence answers with supporting evidence
- Multiple Output Formats: Supports Alpaca, ChatML, and Conversation formats
- Semantic Duplicate Detection: Automatically removes duplicate questions using sentence embeddings
- Token-Aware Processing: Uses tiktoken for accurate token counting and chunking
- Batch Processing: Process multiple markdown/text files in a single run
- Quality Validation: Ensures answer length and content quality standards
pip install sysgenBefore running sysgen, set the API key in your terminal:
# Windows
set GEMINI_API_KEY=your_gemini_api_key_here
# Linux/Mac
export GEMINI_API_KEY=your_gemini_api_key_heresysgen --input-folder md --output dataset.json --format alpacasysgen --input-folder documents --output training_data.json --format chatml --similarity-threshold 0.85--input-folder: Folder containing markdown/text files (default:md)--output: Output JSON file (default:output.json)--format: Output format -alpaca,chatml, orconversation(default:alpaca)--similarity-threshold: Similarity threshold for duplicate detection, 0.0-1.0 (default:0.85)
{
"instruction": "What is the main concept discussed in this section?",
"input": "",
"output": "The main concept discussed is the implementation of neural networks...",
"source_document": "document.md"
}{
"messages": [
{"role": "user", "content": "What is the main concept discussed in this section?"},
{"role": "assistant", "content": "The main concept discussed is the implementation of neural networks..."}
],
"source_document": "document.md"
}{
"conversations": [
{"from": "human", "value": "What is the main concept discussed in this section?"},
{"from": "gpt", "value": "The main concept discussed is the implementation of neural networks..."}
],
"source_document": "document.md"
}- Document Chunking: Splits documents into 3000-token chunks with 200-token overlap
- Question Extraction: Uses advanced AI prompting to extract ALL possible questions from each chunk
- Answer Generation: Creates comprehensive 4-5 sentence answers with supporting evidence
- Quality Filtering: Validates answer length (3-6 sentences) and content quality
- Duplicate Detection: Uses sentence embeddings to identify semantically similar questions
- Format Conversion: Converts to specified output format (Alpaca/ChatML/Conversation)
- Batch Processing: Processes multiple files and combines results
- Token-Aware: Uses tiktoken for accurate token counting
- Sentence Preservation: Keeps sentences intact during chunking
- Overlap Management: Maintains context between chunks with configurable overlap
- Multi-Level Questions: Generates factual, conceptual, analytical, and application questions
- Exhaustive Extraction: Extracts ALL possible questions from content
- Quality Standards: Ensures questions are clear, specific, and answerable
- Embedding-Based: Uses sentence-transformers for semantic similarity
- Configurable Threshold: Adjust sensitivity with similarity_threshold parameter
- Quality Preservation: Keeps highest quality version from duplicate groups
google-genai: Gemini API client for question and answer generationsentence-transformers: Semantic similarity detection for duplicate removalscikit-learn: Cosine similarity calculationstiktoken: Token counting for document chunkingtorch: PyTorch backend for sentence transformerstransformers: Hugging Face transformers librarynumpy: Numerical operationsscipy: Scientific computing utilities
We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.
- Fork the Repository: Start by forking the project on GitHub
- Clone the Repository: Clone it to your local machine
- Create a Branch: Create a new branch for your changes
- Make Changes: Implement your improvements or bug fixes
- Test Your Changes: Ensure the tool works correctly with your modifications
- Submit a Pull Request: Open a PR describing your changes
This project is licensed under the MIT License. See LICENSE for details.
- Author: Adhishtanaka
- Email: kulasoooriyaa@gmail.com