-
Notifications
You must be signed in to change notification settings - Fork 44
feat(issue-5): Implement Training Dataset backend with vector embeddings #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Add TrainingSample types with full TypeScript interfaces (ITrainingSample, IAnswerTemplate, etc.) - Add TrainingSample Mongoose schema with vector embedding support and optimized indices - Add Zod validation schemas for request validation (createTrainingSampleSchema, vectorSearchSchema) - Add Google Generative AI integration for vector embeddings with batch processing - Add TrainingSample service layer with CRUD operations and semantic vector search - Add TrainingSample controller with 6 HTTP endpoints (POST, GET, PUT, DELETE, /search, /stats) - Add TrainingSample routes with comprehensive API endpoint definitions - Add TrainingSample utilities for file parsing (CSV, JSON, Markdown, Text) - Add TrainingDataset model for managing uploaded training files - Add TrainingDataset service for dataset processing and sample creation - Add TrainingDataset controller with file upload and management endpoints - Add TrainingDataset routes for file upload and dataset management All code follows production standards: - Full TypeScript strict typing - Real Google Generative AI integration (not mock) - Proper Mongoose indexing for performance (compound indices) - Comprehensive Zod validation - Error handling and logging - Support for multiple file formats (CSV, JSON, MD, TXT) - Vector semantic search with cosine similarity - Batch embedding generation for large datasets Fixes NexGenStudioDev#5
Summary of ChangesHello @morningstarxcdcode, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust backend system for managing AI training data, encompassing both individual training samples and bulk datasets. It establishes data models with vector embedding capabilities, integrates with Google Generative AI for semantic processing, and provides a full suite of API endpoints for CRUD operations, file uploads, and advanced vector-based search, ensuring data validation and performance. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for handling training datasets, including file uploads, processing, and vector-based semantic search. The code is well-structured and covers many aspects of the feature. However, I've identified several critical and high-severity issues related to performance, scalability, and correctness that should be addressed before merging. Key concerns include a naive in-memory vector search implementation, incorrect batch processing logic, a critical data loss bug in the delete functionality, and inefficient file parsing. Addressing these points will be crucial for creating a robust and production-ready system.
| const trainingSample = await TrainingSample.create({ | ||
| userId: new Types.ObjectId(userId), | ||
| ...sample, | ||
| embedding, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating a TrainingSample from a dataset, the datasetId is not being associated with the sample. This breaks the relationship between the dataset and its samples and will cause critical issues, such as being unable to delete samples associated with a specific dataset. You need to pass the datasetId when creating the TrainingSample.
| const trainingSample = await TrainingSample.create({ | |
| userId: new Types.ObjectId(userId), | |
| ...sample, | |
| embedding, | |
| }) | |
| const trainingSample = await TrainingSample.create({ | |
| userId: new Types.ObjectId(userId), | |
| ...sample, | |
| embedding, | |
| datasetId: dataset._id, | |
| }) |
| await TrainingSample.deleteMany({ | ||
| sourceType: 'dataset', | ||
| userId: new Types.ObjectId(userId), | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical bug in the logic for deleting associated samples. The current implementation deletes all training samples for the user that have sourceType: 'dataset', not just the ones belonging to the specific dataset being deleted. This will lead to unintended data loss. The deletion should be scoped to the datasetId of the dataset being deleted.
Note: This fix depends on associating the datasetId with each TrainingSample during creation.
| await TrainingSample.deleteMany({ | |
| sourceType: 'dataset', | |
| userId: new Types.ObjectId(userId), | |
| }) | |
| await TrainingSample.deleteMany({ | |
| datasetId: dataset._id, | |
| userId: new Types.ObjectId(userId), | |
| }) |
| // Get all matching samples (note: for large datasets, consider MongoDB Atlas Vector Search) | ||
| const samples = await TrainingSample.find(filterQuery).exec() | ||
|
|
||
| // Calculate similarity scores | ||
| const scoredSamples = samples | ||
| .map(sample => ({ | ||
| sample, | ||
| score: EmbeddingUtils.cosineSimilarity(queryEmbedding, sample.embedding), | ||
| })) | ||
| .sort((a, b) => b.score - a.score) | ||
| .slice(0, searchRequest.topK || 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of vector search is not scalable. It fetches all documents matching the filters from the database into memory and then calculates the cosine similarity for each one. For a large number of samples, this will lead to high memory consumption and slow performance, eventually crashing the service.
A proper vector search should be performed at the database level using a vector index. As noted in the comments in TrainingSample.model.ts, MongoDB Atlas Vector Search is a suitable solution for this. You should refactor this to use an aggregation pipeline with the $vectorSearch stage to perform the search efficiently in the database.
| for (const sample of validSamples) { | ||
| try { | ||
| const embeddingText = `${sample.question} ${sample.answerTemplate.answer}` | ||
| const { embedding } = await EmbeddingUtils.generateEmbedding(embeddingText) | ||
|
|
||
| const trainingSample = await TrainingSample.create({ | ||
| userId: new Types.ObjectId(userId), | ||
| ...sample, | ||
| embedding, | ||
| }) | ||
|
|
||
| createdSamples.push(trainingSample) | ||
| } catch (error) { | ||
| console.error('Error creating sample:', sample.question, error) | ||
| // Continue with next sample | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process of generating embeddings and creating training samples is done sequentially in a for...of loop. For large datasets, this will be very slow as each operation waits for the previous one to complete. The PR description mentions "Batch embedding generation for large datasets", but this implementation doesn't leverage batching.
To improve performance, you should batch these operations:
- Use
EmbeddingUtils.generateEmbeddingBatch(after fixing it to be a true batch call) to generate embeddings for multiple samples at once. - Use
TrainingSample.insertMany()to insert multiple samples into the database in a single operation.
This will significantly reduce the processing time for large files.
| async generateEmbeddingBatch(texts: string[]): Promise<IEmbeddingResponse[]> { | ||
| const results = await Promise.all(texts.map(text => this.generateEmbedding(text))) | ||
| return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generateEmbeddingBatch method is not a true batch operation. It iterates over the texts and sends a separate API request for each one using generateEmbedding. This is inefficient, will be slow for many texts, and may quickly hit API rate limits. The Google Generative AI SDK provides a true batching method (batchEmbedContents) for this purpose. You should use it to send multiple texts in a single API call.
| export interface ITrainingDataset extends Document { | ||
| userId: Types.ObjectId | ||
| fileName: string | ||
| filePath: string | ||
| fileMimeType: string | ||
| fileSizeInBytes: number | ||
| totalSamples: number | ||
| processedSamples: number | ||
| status: 'pending' | 'processing' | 'completed' | 'failed' | ||
| errorMessage?: string | ||
| createdAt: Date | ||
| updatedAt: Date | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ITrainingDataset interface is a duplicate and is inconsistent with the primary interface defined in TrainingDataset.model.ts. For example, it's missing name, description, and isActive fields, and has other different fields. Maintaining two different interfaces for the same entity is a source of bugs and confusion.
You should remove this interface and instead export and use the ITrainingDataset interface from LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.model.ts as the single source of truth across the application.
| } | ||
|
|
||
| try { | ||
| const [question, answer, type, tags, language] = line.split(',').map(s => s.trim()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV parsing logic uses line.split(','). This is a naive approach that will fail if any of the fields (like question or answer) contain a comma, which is a common scenario. This will lead to incorrect parsing and data corruption. You should use a robust, stream-based CSV parsing library like csv-parse to handle CSV files correctly, including quoted fields and other edge cases.
| * Expected format: Array of objects with question, answer, type, tags, language | ||
| */ | ||
| static async parseJSON(filePath: string): Promise<ParsedSample[]> { | ||
| const content = fs.readFileSync(filePath, 'utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parseJSON method uses fs.readFileSync to read the entire file into memory before parsing. The file size limit is 100MB, so this approach can lead to high memory usage and block the Node.js event loop, impacting the server's performance and ability to handle other requests. For large files, you should use a streaming JSON parser (e.g., JSONStream or clarinet) to process the file without loading it all into memory at once. The same issue applies to parseMarkdown and parseText in this file.
| const { skip = 0, limit = 20 } = req.query | ||
|
|
||
| const result = await TrainingDatasetService.getDatasets( | ||
| userId, | ||
| parseInt(skip as string) || 0, | ||
| parseInt(limit as string) || 20 | ||
| ) | ||
|
|
||
| res.status(200).json({ | ||
| success: true, | ||
| data: result.datasets, | ||
| pagination: { | ||
| skip: parseInt(skip as string) || 0, | ||
| limit: parseInt(limit as string) || 20, | ||
| total: result.total, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pagination parameters skip and limit are parsed from the query string multiple times. This is inefficient and makes the code verbose. It's better to parse them once at the beginning of the method, store them in variables, and then reuse those variables. Also, parseInt should always be used with a radix (e.g., 10 for decimal) to avoid unexpected behavior.
const skip = parseInt(req.query.skip as string, 10) || 0
const limit = parseInt(req.query.limit as string, 10) || 20
const result = await TrainingDatasetService.getDatasets(userId, skip, limit)
res.status(200).json({
success: true,
data: result.datasets,
pagination: {
skip,
limit,
total: result.total,
},
})There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive Training Dataset backend system with vector embeddings for semantic search. It adds full CRUD operations for training samples, file upload/parsing capabilities for multiple formats (CSV, JSON, Markdown, Text), and Google Generative AI integration for generating vector embeddings.
Key Changes:
- Added TypeScript models, schemas, and validators for TrainingSample and TrainingDataset entities with vector embedding support
- Integrated Google Generative AI API for embedding generation with batch processing capabilities
- Implemented semantic vector search using cosine similarity for intelligent sample retrieval
- Added file parsing utilities supporting CSV, JSON, Markdown, and Text formats for bulk dataset imports
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| TrainingSample.types.ts | Defines TypeScript interfaces for training samples, datasets, embeddings, and search requests |
| TrainingSample.validator.ts | Provides Zod schemas for validating training sample creation, updates, and vector search requests |
| TrainingSample.model.ts | Mongoose schema for training samples with embedded vectors, compound indices, and relationships |
| TrainingSample.embedding.ts | Google Generative AI integration for embedding generation with cosine similarity utility |
| TrainingSample.utils.ts | File parsing utilities for CSV, JSON, Markdown, and Text formats with validation |
| TrainingSample.service.ts | Business logic for CRUD operations, vector search, and statistics aggregation |
| TrainingSample.controller.ts | HTTP endpoint handlers for training sample operations with validation and error handling |
| TrainingSample.routes.ts | Express route definitions for training sample API endpoints |
| TrainingDataset.model.ts | Mongoose schema for managing uploaded training dataset files and processing status |
| TrainingDataset.service.ts | Service layer for dataset processing, parsing, and sample creation from files |
| TrainingDataset.controller.ts | Controller for dataset file uploads using multer with format validation |
| TrainingDataset.routes.ts | Express routes for dataset upload and management endpoints |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * Simple format: pairs of questions and answers separated by newlines | ||
| */ | ||
| static async parseText(filePath: string): Promise<ParsedSample[]> { | ||
| const content = fs.readFileSync(filePath, 'utf-8') |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronous file operation (readFileSync) blocks the event loop. Replace with the asynchronous version (fs.promises.readFile or async/await with fs.readFile) to maintain non-blocking I/O and better performance in a Node.js application.
| createdSamples.push(trainingSample) | ||
| } catch (error) { | ||
| console.error('Error creating sample:', sample.question, error) | ||
| // Continue with next sample | ||
| } | ||
| } |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling logs but doesn't propagate all errors. Errors during sample creation are caught, logged to console, and processing continues, but these errors aren't tracked or reported in the final response. Users won't know if some samples failed to process. Consider collecting failed samples and including them in the errorMessage or a separate field to provide transparency about partial failures.
| isActive = true, | ||
| language, | ||
| skip = 0, | ||
| limit = 20, | ||
| } = req.query | ||
|
|
||
| const filters = { | ||
| type: type ? (Array.isArray(type) ? type : [type]) : undefined, | ||
| tags: tags ? (Array.isArray(tags) ? tags : [tags]) : undefined, | ||
| sourceType: sourceType as 'manual' | 'dataset' | undefined, | ||
| isActive: isActive === 'true' ? true : isActive === 'false' ? false : undefined, | ||
| language: language as string | undefined, |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filters object construction has inconsistent handling. The 'type' and 'tags' arrays are set to 'undefined' when empty, but this doesn't filter them out of the filters object - they're still passed to the service with 'undefined' values. This could cause unexpected behavior in the service layer. Consider using object spread with conditional properties or explicitly checking and omitting undefined values.
| isActive = true, | |
| language, | |
| skip = 0, | |
| limit = 20, | |
| } = req.query | |
| const filters = { | |
| type: type ? (Array.isArray(type) ? type : [type]) : undefined, | |
| tags: tags ? (Array.isArray(tags) ? tags : [tags]) : undefined, | |
| sourceType: sourceType as 'manual' | 'dataset' | undefined, | |
| isActive: isActive === 'true' ? true : isActive === 'false' ? false : undefined, | |
| language: language as string | undefined, | |
| isActive, | |
| language, | |
| skip = 0, | |
| limit = 20, | |
| } = req.query | |
| const typeArray = type ? (Array.isArray(type) ? type : [type]) : undefined | |
| const tagsArray = tags ? (Array.isArray(tags) ? tags : [tags]) : undefined | |
| const isActiveBool = | |
| isActive === 'true' ? true : isActive === 'false' ? false : undefined | |
| const filters = { | |
| ...(typeArray && { type: typeArray }), | |
| ...(tagsArray && { tags: tagsArray }), | |
| ...(sourceType && { sourceType: sourceType as 'manual' | 'dataset' }), | |
| ...(isActiveBool !== undefined && { isActive: isActiveBool }), | |
| ...(language && { language: language as string }), |
LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.controller.ts
Show resolved
Hide resolved
|
@morningstarxcdcode bro can make some changes |
All code follows production standards:
Fixes #5