feat(issue-5): Implement Training Dataset backend with vector embeddings #73

morningstarxcdcode · 2026-01-04T20:38:08Z

Add TrainingSample types with full TypeScript interfaces (ITrainingSample, IAnswerTemplate, etc.)
Add TrainingSample Mongoose schema with vector embedding support and optimized indices
Add Zod validation schemas for request validation (createTrainingSampleSchema, vectorSearchSchema)
Add Google Generative AI integration for vector embeddings with batch processing
Add TrainingSample service layer with CRUD operations and semantic vector search
Add TrainingSample controller with 6 HTTP endpoints (POST, GET, PUT, DELETE, /search, /stats)
Add TrainingSample routes with comprehensive API endpoint definitions
Add TrainingSample utilities for file parsing (CSV, JSON, Markdown, Text)
Add TrainingDataset model for managing uploaded training files
Add TrainingDataset service for dataset processing and sample creation
Add TrainingDataset controller with file upload and management endpoints
Add TrainingDataset routes for file upload and dataset management

All code follows production standards:

Full TypeScript strict typing
Real Google Generative AI integration (not mock)
Proper Mongoose indexing for performance (compound indices)
Comprehensive Zod validation
Error handling and logging
Support for multiple file formats (CSV, JSON, MD, TXT)
Vector semantic search with cosine similarity
Batch embedding generation for large datasets

Fixes #5

- Add TrainingSample types with full TypeScript interfaces (ITrainingSample, IAnswerTemplate, etc.) - Add TrainingSample Mongoose schema with vector embedding support and optimized indices - Add Zod validation schemas for request validation (createTrainingSampleSchema, vectorSearchSchema) - Add Google Generative AI integration for vector embeddings with batch processing - Add TrainingSample service layer with CRUD operations and semantic vector search - Add TrainingSample controller with 6 HTTP endpoints (POST, GET, PUT, DELETE, /search, /stats) - Add TrainingSample routes with comprehensive API endpoint definitions - Add TrainingSample utilities for file parsing (CSV, JSON, Markdown, Text) - Add TrainingDataset model for managing uploaded training files - Add TrainingDataset service for dataset processing and sample creation - Add TrainingDataset controller with file upload and management endpoints - Add TrainingDataset routes for file upload and dataset management All code follows production standards: - Full TypeScript strict typing - Real Google Generative AI integration (not mock) - Proper Mongoose indexing for performance (compound indices) - Comprehensive Zod validation - Error handling and logging - Support for multiple file formats (CSV, JSON, MD, TXT) - Vector semantic search with cosine similarity - Batch embedding generation for large datasets Fixes NexGenStudioDev#5

gemini-code-assist · 2026-01-04T20:38:31Z

Summary of Changes

Hello @morningstarxcdcode, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust backend system for managing AI training data, encompassing both individual training samples and bulk datasets. It establishes data models with vector embedding capabilities, integrates with Google Generative AI for semantic processing, and provides a full suite of API endpoints for CRUD operations, file uploads, and advanced vector-based search, ensuring data validation and performance.

Highlights

New TrainingSample Types and Interfaces: Added comprehensive TypeScript interfaces for TrainingSample, AnswerTemplate, and related structures to ensure strong typing across the module.
TrainingSample Mongoose Schema with Vector Embeddings: Implemented a Mongoose schema for TrainingSample, including support for vector embeddings and optimized indices for efficient data retrieval and semantic search.
Zod Validation Schemas: Integrated Zod for robust request validation, defining schemas for creating, updating, and performing vector searches on training samples.
Google Generative AI Integration: Established integration with Google Generative AI for generating vector embeddings, including support for batch processing of embedding generation.
TrainingSample Service Layer: Developed a service layer for TrainingSample, providing core CRUD operations and advanced semantic vector search capabilities.
TrainingSample Controller and API Endpoints: Created a controller with 6 dedicated HTTP endpoints for managing training samples, covering creation, retrieval, updates, deletion, semantic search, and statistics.
TrainingSample Routes: Defined comprehensive API endpoint routes for all TrainingSample operations, ensuring clear and structured access to the functionality.
TrainingSample File Parsing Utilities: Added utilities for parsing training data from various file formats, including CSV, JSON, Markdown, and plain Text, facilitating bulk data ingestion.
TrainingDataset Model: Introduced a new Mongoose model, TrainingDataset, for managing and tracking uploaded training files and their processing status.
TrainingDataset Service for Processing: Implemented a service for TrainingDataset to handle the processing of uploaded files, including parsing, validating, and creating individual training samples from the dataset.
TrainingDataset Controller and Endpoints: Developed a controller for TrainingDataset with endpoints for file upload and comprehensive dataset management functionalities.
TrainingDataset Routes: Established API routes for TrainingDataset, enabling file uploads and management of training datasets.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature for handling training datasets, including file uploads, processing, and vector-based semantic search. The code is well-structured and covers many aspects of the feature. However, I've identified several critical and high-severity issues related to performance, scalability, and correctness that should be addressed before merging. Key concerns include a naive in-memory vector search implementation, incorrect batch processing logic, a critical data loss bug in the delete functionality, and inefficient file parsing. Addressing these points will be crucial for creating a robust and production-ready system.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.service.ts

+          const trainingSample = await TrainingSample.create({
+            userId: new Types.ObjectId(userId),
+            ...sample,
+            embedding,
+          })


When creating a TrainingSample from a dataset, the datasetId is not being associated with the sample. This breaks the relationship between the dataset and its samples and will cause critical issues, such as being unable to delete samples associated with a specific dataset. You need to pass the datasetId when creating the TrainingSample.

Suggested change

const trainingSample = await TrainingSample.create({

userId: new Types.ObjectId(userId),

...sample,

embedding,

})

const trainingSample = await TrainingSample.create({

userId: new Types.ObjectId(userId),

...sample,

embedding,

datasetId: dataset._id,

})

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.service.ts

+      await TrainingSample.deleteMany({
+        sourceType: 'dataset',
+        userId: new Types.ObjectId(userId),
+      })


There is a critical bug in the logic for deleting associated samples. The current implementation deletes all training samples for the user that have sourceType: 'dataset', not just the ones belonging to the specific dataset being deleted. This will lead to unintended data loss. The deletion should be scoped to the datasetId of the dataset being deleted.

Note: This fix depends on associating the datasetId with each TrainingSample during creation.

Suggested change

await TrainingSample.deleteMany({

sourceType: 'dataset',

userId: new Types.ObjectId(userId),

})

await TrainingSample.deleteMany({

datasetId: dataset._id,

userId: new Types.ObjectId(userId),

})

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.service.ts

+      // Get all matching samples (note: for large datasets, consider MongoDB Atlas Vector Search)
+      const samples = await TrainingSample.find(filterQuery).exec()
+
+      // Calculate similarity scores
+      const scoredSamples = samples
+        .map(sample => ({
+          sample,
+          score: EmbeddingUtils.cosineSimilarity(queryEmbedding, sample.embedding),
+        }))
+        .sort((a, b) => b.score - a.score)
+        .slice(0, searchRequest.topK || 5)


The current implementation of vector search is not scalable. It fetches all documents matching the filters from the database into memory and then calculates the cosine similarity for each one. For a large number of samples, this will lead to high memory consumption and slow performance, eventually crashing the service.
A proper vector search should be performed at the database level using a vector index. As noted in the comments in TrainingSample.model.ts, MongoDB Atlas Vector Search is a suitable solution for this. You should refactor this to use an aggregation pipeline with the $vectorSearch stage to perform the search efficiently in the database.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.service.ts

+      for (const sample of validSamples) {
+        try {
+          const embeddingText = `${sample.question} ${sample.answerTemplate.answer}`
+          const { embedding } = await EmbeddingUtils.generateEmbedding(embeddingText)
+
+          const trainingSample = await TrainingSample.create({
+            userId: new Types.ObjectId(userId),
+            ...sample,
+            embedding,
+          })
+
+          createdSamples.push(trainingSample)
+        } catch (error) {
+          console.error('Error creating sample:', sample.question, error)
+          // Continue with next sample
+        }
+      }


The process of generating embeddings and creating training samples is done sequentially in a for...of loop. For large datasets, this will be very slow as each operation waits for the previous one to complete. The PR description mentions "Batch embedding generation for large datasets", but this implementation doesn't leverage batching.

To improve performance, you should batch these operations:

Use EmbeddingUtils.generateEmbeddingBatch (after fixing it to be a true batch call) to generate embeddings for multiple samples at once.

Use TrainingSample.insertMany() to insert multiple samples into the database in a single operation.

This will significantly reduce the processing time for large files.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.embedding.ts

+  async generateEmbeddingBatch(texts: string[]): Promise<IEmbeddingResponse[]> {
+    const results = await Promise.all(texts.map(text => this.generateEmbedding(text)))
+    return results


This generateEmbeddingBatch method is not a true batch operation. It iterates over the texts and sends a separate API request for each one using generateEmbedding. This is inefficient, will be slow for many texts, and may quickly hit API rate limits. The Google Generative AI SDK provides a true batching method (batchEmbedContents) for this purpose. You should use it to send multiple texts in a single API call.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.types.ts

+export interface ITrainingDataset extends Document {
+  userId: Types.ObjectId
+  fileName: string
+  filePath: string
+  fileMimeType: string
+  fileSizeInBytes: number
+  totalSamples: number
+  processedSamples: number
+  status: 'pending' | 'processing' | 'completed' | 'failed'
+  errorMessage?: string
+  createdAt: Date
+  updatedAt: Date
+}


This ITrainingDataset interface is a duplicate and is inconsistent with the primary interface defined in TrainingDataset.model.ts. For example, it's missing name, description, and isActive fields, and has other different fields. Maintaining two different interfaces for the same entity is a source of bugs and confusion.

You should remove this interface and instead export and use the ITrainingDataset interface from LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.model.ts as the single source of truth across the application.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.utils.ts

+        }
+
+        try {
+          const [question, answer, type, tags, language] = line.split(',').map(s => s.trim())


The CSV parsing logic uses line.split(','). This is a naive approach that will fail if any of the fields (like question or answer) contain a comma, which is a common scenario. This will lead to incorrect parsing and data corruption. You should use a robust, stream-based CSV parsing library like csv-parse to handle CSV files correctly, including quoted fields and other edge cases.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.utils.ts

+   * Expected format: Array of objects with question, answer, type, tags, language
+   */
+  static async parseJSON(filePath: string): Promise<ParsedSample[]> {
+    const content = fs.readFileSync(filePath, 'utf-8')


The parseJSON method uses fs.readFileSync to read the entire file into memory before parsing. The file size limit is 100MB, so this approach can lead to high memory usage and block the Node.js event loop, impacting the server's performance and ability to handle other requests. For large files, you should use a streaming JSON parser (e.g., JSONStream or clarinet) to process the file without loading it all into memory at once. The same issue applies to parseMarkdown and parseText in this file.

gemini-code-assist · 2026-01-04T20:40:28Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.controller.ts

+      const { skip = 0, limit = 20 } = req.query
+
+      const result = await TrainingDatasetService.getDatasets(
+        userId,
+        parseInt(skip as string) || 0,
+        parseInt(limit as string) || 20
+      )
+
+      res.status(200).json({
+        success: true,
+        data: result.datasets,
+        pagination: {
+          skip: parseInt(skip as string) || 0,
+          limit: parseInt(limit as string) || 20,
+          total: result.total,
+        },
+      })


The pagination parameters skip and limit are parsed from the query string multiple times. This is inefficient and makes the code verbose. It's better to parse them once at the beginning of the method, store them in variables, and then reuse those variables. Also, parseInt should always be used with a radix (e.g., 10 for decimal) to avoid unexpected behavior.

const skip = parseInt(req.query.skip as string, 10) || 0 const limit = parseInt(req.query.limit as string, 10) || 20 const result = await TrainingDatasetService.getDatasets(userId, skip, limit) res.status(200).json({ success: true, data: result.datasets, pagination: { skip, limit, total: result.total, }, })

Copilot

Pull request overview

This PR implements a comprehensive Training Dataset backend system with vector embeddings for semantic search. It adds full CRUD operations for training samples, file upload/parsing capabilities for multiple formats (CSV, JSON, Markdown, Text), and Google Generative AI integration for generating vector embeddings.

Key Changes:

Added TypeScript models, schemas, and validators for TrainingSample and TrainingDataset entities with vector embedding support
Integrated Google Generative AI API for embedding generation with batch processing capabilities
Implemented semantic vector search using cosine similarity for intelligent sample retrieval
Added file parsing utilities supporting CSV, JSON, Markdown, and Text formats for bulk dataset imports

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 22 comments.

Show a summary per file

File	Description
TrainingSample.types.ts	Defines TypeScript interfaces for training samples, datasets, embeddings, and search requests
TrainingSample.validator.ts	Provides Zod schemas for validating training sample creation, updates, and vector search requests
TrainingSample.model.ts	Mongoose schema for training samples with embedded vectors, compound indices, and relationships
TrainingSample.embedding.ts	Google Generative AI integration for embedding generation with cosine similarity utility
TrainingSample.utils.ts	File parsing utilities for CSV, JSON, Markdown, and Text formats with validation
TrainingSample.service.ts	Business logic for CRUD operations, vector search, and statistics aggregation
TrainingSample.controller.ts	HTTP endpoint handlers for training sample operations with validation and error handling
TrainingSample.routes.ts	Express route definitions for training sample API endpoints
TrainingDataset.model.ts	Mongoose schema for managing uploaded training dataset files and processing status
TrainingDataset.service.ts	Service layer for dataset processing, parsing, and sample creation from files
TrainingDataset.controller.ts	Controller for dataset file uploads using multer with format validation
TrainingDataset.routes.ts	Express routes for dataset upload and management endpoints

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-04T20:42:45Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.utils.ts

+   * Simple format: pairs of questions and answers separated by newlines
+   */
+  static async parseText(filePath: string): Promise<ParsedSample[]> {
+    const content = fs.readFileSync(filePath, 'utf-8')


Synchronous file operation (readFileSync) blocks the event loop. Replace with the asynchronous version (fs.promises.readFile or async/await with fs.readFile) to maintain non-blocking I/O and better performance in a Node.js application.

Copilot · 2026-01-04T20:42:45Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.service.ts

+          createdSamples.push(trainingSample)
+        } catch (error) {
+          console.error('Error creating sample:', sample.question, error)
+          // Continue with next sample
+        }
+      }


Error handling logs but doesn't propagate all errors. Errors during sample creation are caught, logged to console, and processing continues, but these errors aren't tracked or reported in the final response. Users won't know if some samples failed to process. Consider collecting failed samples and including them in the errorMessage or a separate field to provide transparency about partial failures.

Copilot · 2026-01-04T20:42:45Z

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.controller.ts

+        isActive = true,
+        language,
+        skip = 0,
+        limit = 20,
+      } = req.query
+
+      const filters = {
+        type: type ? (Array.isArray(type) ? type : [type]) : undefined,
+        tags: tags ? (Array.isArray(tags) ? tags : [tags]) : undefined,
+        sourceType: sourceType as 'manual' | 'dataset' | undefined,
+        isActive: isActive === 'true' ? true : isActive === 'false' ? false : undefined,
+        language: language as string | undefined,


The filters object construction has inconsistent handling. The 'type' and 'tags' arrays are set to 'undefined' when empty, but this doesn't filter them out of the filters object - they're still passed to the service with 'undefined' values. This could cause unexpected behavior in the service layer. Consider using object spread with conditional properties or explicitly checking and omitting undefined values.

Suggested change

isActive = true,

language,

skip = 0,

limit = 20,

} = req.query

const filters = {

type: type ? (Array.isArray(type) ? type : [type]) : undefined,

tags: tags ? (Array.isArray(tags) ? tags : [tags]) : undefined,

sourceType: sourceType as 'manual' | 'dataset' | undefined,

isActive: isActive === 'true' ? true : isActive === 'false' ? false : undefined,

language: language as string | undefined,

isActive,

language,

skip = 0,

limit = 20,

} = req.query

const typeArray = type ? (Array.isArray(type) ? type : [type]) : undefined

const tagsArray = tags ? (Array.isArray(tags) ? tags : [tags]) : undefined

const isActiveBool =

isActive === 'true' ? true : isActive === 'false' ? false : undefined

const filters = {

...(typeArray && { type: typeArray }),

...(tagsArray && { tags: tagsArray }),

...(sourceType && { sourceType: sourceType as 'manual' | 'dataset' }),

...(isActiveBool !== undefined && { isActive: isActiveBool }),

...(language && { language: language as string }),

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.routes.ts

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.controller.ts

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.service.ts

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.embedding.ts

LocalMind-Backend/src/api/v1/TrainingSample/TrainingSample.utils.ts

LocalMind-Backend/src/api/v1/TrainingSample/TrainingDataset.service.ts

abhishek-nexgen-dev · 2026-01-09T10:15:47Z

@morningstarxcdcode bro can make some changes

Copilot AI review requested due to automatic review settings January 4, 2026 20:38

Copilot started reviewing on behalf of morningstarxcdcode January 4, 2026 20:38 View session

gemini-code-assist bot reviewed Jan 4, 2026

View reviewed changes

Copilot AI reviewed Jan 4, 2026

View reviewed changes

feat(issue-5): Implement Training Dataset backend with vector embeddings #73

Are you sure you want to change the base?

feat(issue-5): Implement Training Dataset backend with vector embeddings #73

Uh oh!

Conversation

morningstarxcdcode commented Jan 4, 2026

Uh oh!

gemini-code-assist bot commented Jan 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhishek-nexgen-dev commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants