Skip to content

Conversation

@morningstarxcdcode
Copy link

  • Add TrainingSample types with full TypeScript interfaces (ITrainingSample, IAnswerTemplate, etc.)
  • Add TrainingSample Mongoose schema with vector embedding support and optimized indices
  • Add Zod validation schemas for request validation (createTrainingSampleSchema, vectorSearchSchema)
  • Add Google Generative AI integration for vector embeddings with batch processing
  • Add TrainingSample service layer with CRUD operations and semantic vector search
  • Add TrainingSample controller with 6 HTTP endpoints (POST, GET, PUT, DELETE, /search, /stats)
  • Add TrainingSample routes with comprehensive API endpoint definitions
  • Add TrainingSample utilities for file parsing (CSV, JSON, Markdown, Text)
  • Add TrainingDataset model for managing uploaded training files
  • Add TrainingDataset service for dataset processing and sample creation
  • Add TrainingDataset controller with file upload and management endpoints
  • Add TrainingDataset routes for file upload and dataset management

All code follows production standards:

  • Full TypeScript strict typing
  • Real Google Generative AI integration (not mock)
  • Proper Mongoose indexing for performance (compound indices)
  • Comprehensive Zod validation
  • Error handling and logging
  • Support for multiple file formats (CSV, JSON, MD, TXT)
  • Vector semantic search with cosine similarity
  • Batch embedding generation for large datasets

Fixes #5

- Add TrainingSample types with full TypeScript interfaces (ITrainingSample, IAnswerTemplate, etc.)
- Add TrainingSample Mongoose schema with vector embedding support and optimized indices
- Add Zod validation schemas for request validation (createTrainingSampleSchema, vectorSearchSchema)
- Add Google Generative AI integration for vector embeddings with batch processing
- Add TrainingSample service layer with CRUD operations and semantic vector search
- Add TrainingSample controller with 6 HTTP endpoints (POST, GET, PUT, DELETE, /search, /stats)
- Add TrainingSample routes with comprehensive API endpoint definitions
- Add TrainingSample utilities for file parsing (CSV, JSON, Markdown, Text)
- Add TrainingDataset model for managing uploaded training files
- Add TrainingDataset service for dataset processing and sample creation
- Add TrainingDataset controller with file upload and management endpoints
- Add TrainingDataset routes for file upload and dataset management

All code follows production standards:
- Full TypeScript strict typing
- Real Google Generative AI integration (not mock)
- Proper Mongoose indexing for performance (compound indices)
- Comprehensive Zod validation
- Error handling and logging
- Support for multiple file formats (CSV, JSON, MD, TXT)
- Vector semantic search with cosine similarity
- Batch embedding generation for large datasets

Fixes NexGenStudioDev#5
Copilot AI review requested due to automatic review settings January 4, 2026 20:38
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @morningstarxcdcode, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust backend system for managing AI training data, encompassing both individual training samples and bulk datasets. It establishes data models with vector embedding capabilities, integrates with Google Generative AI for semantic processing, and provides a full suite of API endpoints for CRUD operations, file uploads, and advanced vector-based search, ensuring data validation and performance.

Highlights

  • New TrainingSample Types and Interfaces: Added comprehensive TypeScript interfaces for TrainingSample, AnswerTemplate, and related structures to ensure strong typing across the module.
  • TrainingSample Mongoose Schema with Vector Embeddings: Implemented a Mongoose schema for TrainingSample, including support for vector embeddings and optimized indices for efficient data retrieval and semantic search.
  • Zod Validation Schemas: Integrated Zod for robust request validation, defining schemas for creating, updating, and performing vector searches on training samples.
  • Google Generative AI Integration: Established integration with Google Generative AI for generating vector embeddings, including support for batch processing of embedding generation.
  • TrainingSample Service Layer: Developed a service layer for TrainingSample, providing core CRUD operations and advanced semantic vector search capabilities.
  • TrainingSample Controller and API Endpoints: Created a controller with 6 dedicated HTTP endpoints for managing training samples, covering creation, retrieval, updates, deletion, semantic search, and statistics.
  • TrainingSample Routes: Defined comprehensive API endpoint routes for all TrainingSample operations, ensuring clear and structured access to the functionality.
  • TrainingSample File Parsing Utilities: Added utilities for parsing training data from various file formats, including CSV, JSON, Markdown, and plain Text, facilitating bulk data ingestion.
  • TrainingDataset Model: Introduced a new Mongoose model, TrainingDataset, for managing and tracking uploaded training files and their processing status.
  • TrainingDataset Service for Processing: Implemented a service for TrainingDataset to handle the processing of uploaded files, including parsing, validating, and creating individual training samples from the dataset.
  • TrainingDataset Controller and Endpoints: Developed a controller for TrainingDataset with endpoints for file upload and comprehensive dataset management functionalities.
  • TrainingDataset Routes: Established API routes for TrainingDataset, enabling file uploads and management of training datasets.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for handling training datasets, including file uploads, processing, and vector-based semantic search. The code is well-structured and covers many aspects of the feature. However, I've identified several critical and high-severity issues related to performance, scalability, and correctness that should be addressed before merging. Key concerns include a naive in-memory vector search implementation, incorrect batch processing logic, a critical data loss bug in the delete functionality, and inefficient file parsing. Addressing these points will be crucial for creating a robust and production-ready system.

Comment on lines +67 to +71
const trainingSample = await TrainingSample.create({
userId: new Types.ObjectId(userId),
...sample,
embedding,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When creating a TrainingSample from a dataset, the datasetId is not being associated with the sample. This breaks the relationship between the dataset and its samples and will cause critical issues, such as being unable to delete samples associated with a specific dataset. You need to pass the datasetId when creating the TrainingSample.

Suggested change
const trainingSample = await TrainingSample.create({
userId: new Types.ObjectId(userId),
...sample,
embedding,
})
const trainingSample = await TrainingSample.create({
userId: new Types.ObjectId(userId),
...sample,
embedding,
datasetId: dataset._id,
})

Comment on lines +157 to +160
await TrainingSample.deleteMany({
sourceType: 'dataset',
userId: new Types.ObjectId(userId),
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in the logic for deleting associated samples. The current implementation deletes all training samples for the user that have sourceType: 'dataset', not just the ones belonging to the specific dataset being deleted. This will lead to unintended data loss. The deletion should be scoped to the datasetId of the dataset being deleted.

Note: This fix depends on associating the datasetId with each TrainingSample during creation.

Suggested change
await TrainingSample.deleteMany({
sourceType: 'dataset',
userId: new Types.ObjectId(userId),
})
await TrainingSample.deleteMany({
datasetId: dataset._id,
userId: new Types.ObjectId(userId),
})

Comment on lines +161 to +171
// Get all matching samples (note: for large datasets, consider MongoDB Atlas Vector Search)
const samples = await TrainingSample.find(filterQuery).exec()

// Calculate similarity scores
const scoredSamples = samples
.map(sample => ({
sample,
score: EmbeddingUtils.cosineSimilarity(queryEmbedding, sample.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, searchRequest.topK || 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of vector search is not scalable. It fetches all documents matching the filters from the database into memory and then calculates the cosine similarity for each one. For a large number of samples, this will lead to high memory consumption and slow performance, eventually crashing the service.
A proper vector search should be performed at the database level using a vector index. As noted in the comments in TrainingSample.model.ts, MongoDB Atlas Vector Search is a suitable solution for this. You should refactor this to use an aggregation pipeline with the $vectorSearch stage to perform the search efficiently in the database.

Comment on lines +62 to +78
for (const sample of validSamples) {
try {
const embeddingText = `${sample.question} ${sample.answerTemplate.answer}`
const { embedding } = await EmbeddingUtils.generateEmbedding(embeddingText)

const trainingSample = await TrainingSample.create({
userId: new Types.ObjectId(userId),
...sample,
embedding,
})

createdSamples.push(trainingSample)
} catch (error) {
console.error('Error creating sample:', sample.question, error)
// Continue with next sample
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The process of generating embeddings and creating training samples is done sequentially in a for...of loop. For large datasets, this will be very slow as each operation waits for the previous one to complete. The PR description mentions "Batch embedding generation for large datasets", but this implementation doesn't leverage batching.

To improve performance, you should batch these operations:

  1. Use EmbeddingUtils.generateEmbeddingBatch (after fixing it to be a true batch call) to generate embeddings for multiple samples at once.
  2. Use TrainingSample.insertMany() to insert multiple samples into the database in a single operation.

This will significantly reduce the processing time for large files.

Comment on lines +37 to +39
async generateEmbeddingBatch(texts: string[]): Promise<IEmbeddingResponse[]> {
const results = await Promise.all(texts.map(text => this.generateEmbedding(text)))
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This generateEmbeddingBatch method is not a true batch operation. It iterates over the texts and sends a separate API request for each one using generateEmbedding. This is inefficient, will be slow for many texts, and may quickly hit API rate limits. The Google Generative AI SDK provides a true batching method (batchEmbedContents) for this purpose. You should use it to send multiple texts in a single API call.

Comment on lines +40 to +52
export interface ITrainingDataset extends Document {
userId: Types.ObjectId
fileName: string
filePath: string
fileMimeType: string
fileSizeInBytes: number
totalSamples: number
processedSamples: number
status: 'pending' | 'processing' | 'completed' | 'failed'
errorMessage?: string
createdAt: Date
updatedAt: Date
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This ITrainingDataset interface is a duplicate and is inconsistent with the primary interface defined in TrainingDataset.model.ts. For example, it's missing name, description, and isActive fields, and has other different fields. Maintaining two different interfaces for the same entity is a source of bugs and confusion.

You should remove this interface and instead export and use the ITrainingDataset interface from LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.model.ts as the single source of truth across the application.

}

try {
const [question, answer, type, tags, language] = line.split(',').map(s => s.trim())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CSV parsing logic uses line.split(','). This is a naive approach that will fail if any of the fields (like question or answer) contain a comma, which is a common scenario. This will lead to incorrect parsing and data corruption. You should use a robust, stream-based CSV parsing library like csv-parse to handle CSV files correctly, including quoted fields and other edge cases.

* Expected format: Array of objects with question, answer, type, tags, language
*/
static async parseJSON(filePath: string): Promise<ParsedSample[]> {
const content = fs.readFileSync(filePath, 'utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parseJSON method uses fs.readFileSync to read the entire file into memory before parsing. The file size limit is 100MB, so this approach can lead to high memory usage and block the Node.js event loop, impacting the server's performance and ability to handle other requests. For large files, you should use a streaming JSON parser (e.g., JSONStream or clarinet) to process the file without loading it all into memory at once. The same issue applies to parseMarkdown and parseText in this file.

Comment on lines +125 to +141
const { skip = 0, limit = 20 } = req.query

const result = await TrainingDatasetService.getDatasets(
userId,
parseInt(skip as string) || 0,
parseInt(limit as string) || 20
)

res.status(200).json({
success: true,
data: result.datasets,
pagination: {
skip: parseInt(skip as string) || 0,
limit: parseInt(limit as string) || 20,
total: result.total,
},
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pagination parameters skip and limit are parsed from the query string multiple times. This is inefficient and makes the code verbose. It's better to parse them once at the beginning of the method, store them in variables, and then reuse those variables. Also, parseInt should always be used with a radix (e.g., 10 for decimal) to avoid unexpected behavior.

      const skip = parseInt(req.query.skip as string, 10) || 0
      const limit = parseInt(req.query.limit as string, 10) || 20

      const result = await TrainingDatasetService.getDatasets(userId, skip, limit)

      res.status(200).json({
        success: true,
        data: result.datasets,
        pagination: {
          skip,
          limit,
          total: result.total,
        },
      })

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive Training Dataset backend system with vector embeddings for semantic search. It adds full CRUD operations for training samples, file upload/parsing capabilities for multiple formats (CSV, JSON, Markdown, Text), and Google Generative AI integration for generating vector embeddings.

Key Changes:

  • Added TypeScript models, schemas, and validators for TrainingSample and TrainingDataset entities with vector embedding support
  • Integrated Google Generative AI API for embedding generation with batch processing capabilities
  • Implemented semantic vector search using cosine similarity for intelligent sample retrieval
  • Added file parsing utilities supporting CSV, JSON, Markdown, and Text formats for bulk dataset imports

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
TrainingSample.types.ts Defines TypeScript interfaces for training samples, datasets, embeddings, and search requests
TrainingSample.validator.ts Provides Zod schemas for validating training sample creation, updates, and vector search requests
TrainingSample.model.ts Mongoose schema for training samples with embedded vectors, compound indices, and relationships
TrainingSample.embedding.ts Google Generative AI integration for embedding generation with cosine similarity utility
TrainingSample.utils.ts File parsing utilities for CSV, JSON, Markdown, and Text formats with validation
TrainingSample.service.ts Business logic for CRUD operations, vector search, and statistics aggregation
TrainingSample.controller.ts HTTP endpoint handlers for training sample operations with validation and error handling
TrainingSample.routes.ts Express route definitions for training sample API endpoints
TrainingDataset.model.ts Mongoose schema for managing uploaded training dataset files and processing status
TrainingDataset.service.ts Service layer for dataset processing, parsing, and sample creation from files
TrainingDataset.controller.ts Controller for dataset file uploads using multer with format validation
TrainingDataset.routes.ts Express routes for dataset upload and management endpoints

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* Simple format: pairs of questions and answers separated by newlines
*/
static async parseText(filePath: string): Promise<ParsedSample[]> {
const content = fs.readFileSync(filePath, 'utf-8')
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronous file operation (readFileSync) blocks the event loop. Replace with the asynchronous version (fs.promises.readFile or async/await with fs.readFile) to maintain non-blocking I/O and better performance in a Node.js application.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +78
createdSamples.push(trainingSample)
} catch (error) {
console.error('Error creating sample:', sample.question, error)
// Continue with next sample
}
}
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling logs but doesn't propagate all errors. Errors during sample creation are caught, logged to console, and processing continues, but these errors aren't tracked or reported in the final response. Users won't know if some samples failed to process. Consider collecting failed samples and including them in the errorMessage or a separate field to provide transparency about partial failures.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +68
isActive = true,
language,
skip = 0,
limit = 20,
} = req.query

const filters = {
type: type ? (Array.isArray(type) ? type : [type]) : undefined,
tags: tags ? (Array.isArray(tags) ? tags : [tags]) : undefined,
sourceType: sourceType as 'manual' | 'dataset' | undefined,
isActive: isActive === 'true' ? true : isActive === 'false' ? false : undefined,
language: language as string | undefined,
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filters object construction has inconsistent handling. The 'type' and 'tags' arrays are set to 'undefined' when empty, but this doesn't filter them out of the filters object - they're still passed to the service with 'undefined' values. This could cause unexpected behavior in the service layer. Consider using object spread with conditional properties or explicitly checking and omitting undefined values.

Suggested change
isActive = true,
language,
skip = 0,
limit = 20,
} = req.query
const filters = {
type: type ? (Array.isArray(type) ? type : [type]) : undefined,
tags: tags ? (Array.isArray(tags) ? tags : [tags]) : undefined,
sourceType: sourceType as 'manual' | 'dataset' | undefined,
isActive: isActive === 'true' ? true : isActive === 'false' ? false : undefined,
language: language as string | undefined,
isActive,
language,
skip = 0,
limit = 20,
} = req.query
const typeArray = type ? (Array.isArray(type) ? type : [type]) : undefined
const tagsArray = tags ? (Array.isArray(tags) ? tags : [tags]) : undefined
const isActiveBool =
isActive === 'true' ? true : isActive === 'false' ? false : undefined
const filters = {
...(typeArray && { type: typeArray }),
...(tagsArray && { tags: tagsArray }),
...(sourceType && { sourceType: sourceType as 'manual' | 'dataset' }),
...(isActiveBool !== undefined && { isActive: isActiveBool }),
...(language && { language: language as string }),

Copilot uses AI. Check for mistakes.
@abhishek-nexgen-dev
Copy link
Member

@morningstarxcdcode bro can make some changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend: Schema & APIs for AI Training Dataset (MongoDB + Vector + Upload)

2 participants