Skip to content

Conversation

@yash12991
Copy link

@yash12991 yash12991 commented Jan 5, 2026

  • Add support for PDF, TXT, JSON, and TSV file formats in addition to CSV
  • Implement FileLoaderUtils class for handling multiple file formats
  • Add file upload functionality using multer middleware
  • Create comprehensive validation for uploaded files (type, size, format)
  • Update DataSet controller with automatic file format detection
  • Add new endpoint GET /api/v1/dataset/formats to list supported formats
  • Change upload endpoint from GET to POST with multipart/form-data
  • Add TypeScript types and interfaces for better type safety
  • Include test sample files for each supported format
  • Add comprehensive documentation in README.md
  • Implement automatic file cleanup after processing
    Closes feat: Add multi-format file upload support for RAG (PDF, TXT, JSON, TSV) #91
    Dependencies added:
  • pdf-parse: For PDF document parsing
  • multer: For handling file uploads
  • @types/multer: TypeScript types for multer

Breaking changes:

  • Upload endpoint changed from GET /upload to POST /upload
  • Now requires file upload via form-data instead of hardcoded path

- Add support for PDF, TXT, JSON, and TSV file formats in addition to CSV
- Implement FileLoaderUtils class for handling multiple file formats
- Add file upload functionality using multer middleware
- Create comprehensive validation for uploaded files (type, size, format)
- Update DataSet controller with automatic file format detection
- Add new endpoint GET /api/v1/dataset/formats to list supported formats
- Change upload endpoint from GET to POST with multipart/form-data
- Add TypeScript types and interfaces for better type safety
- Include test sample files for each supported format
- Add comprehensive documentation in README.md
- Implement automatic file cleanup after processing

Dependencies added:
- pdf-parse: For PDF document parsing
- multer: For handling file uploads
- @types/multer: TypeScript types for multer

Breaking changes:
- Upload endpoint changed from GET /upload to POST /upload
- Now requires file upload via form-data instead of hardcoded path
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yash12991, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the RAG system's data ingestion capabilities by introducing support for multiple file formats. Previously limited to CSV, the system can now process PDF, TXT, JSON, and TSV files, broadening its applicability for various knowledge bases. The changes involve a complete overhaul of the file upload and processing pipeline, incorporating robust validation, dynamic format detection, and automatic cleanup, all while maintaining a clear and extensible architecture.

Highlights

  • Multi-Format File Upload: The RAG system now supports uploading and processing datasets in PDF, TXT, JSON, and TSV formats, in addition to the existing CSV support. This significantly expands the types of data that can be used for knowledge retrieval.
  • Modular File Handling: A new FileLoaderUtils class has been implemented to centralize file format detection and loading logic, making the system extensible for future formats.
  • Robust File Upload Mechanism: The pull request integrates multer middleware for handling multipart/form-data file uploads, ensuring secure and efficient file reception.
  • Comprehensive File Validation: Uploaded files undergo thorough validation for type, size (max 10MB, not empty), and format, preventing invalid or malicious uploads.
  • Dynamic Format Detection: The DataSetController now automatically detects the file format based on its extension and MIME type, streamlining the processing workflow.
  • New API Endpoint for Supported Formats: A new GET /api/v1/dataset/formats endpoint has been added, allowing clients to query the currently supported file formats and their descriptions.
  • Automatic File Cleanup: Uploaded files are automatically removed from the server after successful processing or in case of an error, ensuring system hygiene and preventing accumulation of temporary files.
  • Breaking Change: Upload Endpoint: The dataset upload endpoint has changed from GET /upload to POST /upload and now requires file submission via multipart/form-data instead of a hardcoded path.
  • Enhanced Type Safety: New TypeScript types and interfaces (FileFormat, UploadedFileMetadata, etc.) have been introduced to improve code clarity and maintainability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant and valuable functionality by adding support for multiple file formats (PDF, TXT, JSON, TSV) for dataset uploads. The implementation is well-structured, with new utility classes for file loading, multer configuration for uploads, and validation, which greatly enhances the RAG system's flexibility.

My review focuses on improving error handling, security, and performance. I've suggested changes to:

  • Prevent potential information leaks in error responses.
  • Use asynchronous file operations to avoid blocking the event loop.
  • Refine the API response for supported formats to ensure it accurately reflects the current implementation.
  • Improve error wrapping to preserve stack traces for easier debugging.
  • Consolidate validation logic to avoid redundancy and ensure consistent error handling.

Overall, this is a great addition to the project. Addressing these points will make the new functionality more robust and secure.

Comment on lines +33 to +35
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using fs.unlinkSync is synchronous and blocks the Node.js event loop. In a server environment, this can degrade performance, especially under load. It's recommended to use the asynchronous version, fs.promises.unlink, to avoid blocking. This comment also applies to the other uses of unlinkSync in this file (lines 49, 60, and 71).

Suggested change
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath)
}
if (fs.existsSync(filePath)) {
await fs.promises.unlink(filePath)
}

Comment on lines +73 to +78
SendResponse.error(
res,
'Failed to upload and process dataset',
500,
error.message
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Exposing raw error messages to the client can be a security risk, as it might leak sensitive information about the application's internals (e.g., file paths, library issues). It's better to log the detailed error on the server for debugging and send a generic error message to the client.

Suggested change
SendResponse.error(
res,
'Failed to upload and process dataset',
500,
error.message
)
console.error('Failed to upload and process dataset:', error);
SendResponse.error(
res,
'Failed to upload and process dataset',
500
)

Comment on lines +90 to +101
const formats = Object.values(FileFormat)
SendResponse.success(res, 'Supported file formats', {
formats,
description: {
csv: 'Comma-separated values file',
xlsx: 'Excel spreadsheet (not yet fully supported)',
tsv: 'Tab-separated values file',
json: 'JSON file with Q&A pairs',
pdf: 'PDF document',
txt: 'Plain text file',
},
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The getSupportedFormats endpoint currently lists 'xlsx' as a supported format, but the implementation in DataSet.fileLoader.ts throws an error because it's not yet implemented. This can be misleading for API consumers. It's better to remove 'xlsx' from the list of supported formats until it is fully functional.

      const formats = Object.values(FileFormat).filter(
        (f) => f !== FileFormat.XLSX
      )
      SendResponse.success(res, 'Supported file formats', {
        formats,
        description: {
          csv: 'Comma-separated values file',
          tsv: 'Tab-separated values file',
          json: 'JSON file with Q&A pairs',
          pdf: 'PDF document',
          txt: 'Plain text file',
        },
      })

Comment on lines +193 to +195
} catch (error) {
throw new Error(`Failed to load file: ${error}`)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wrapping the caught error object directly in a new Error constructor will stringify it (often to [object Object]), losing valuable information like the original error's stack trace and type. To preserve this information for better debugging, you should re-throw the original error or create a new error that includes the original error's message.

Suggested change
} catch (error) {
throw new Error(`Failed to load file: ${error}`)
}
} catch (error: any) {
throw new Error(`Failed to load file: ${error.message}`)
}

Comment on lines +35 to +94
public validateFileUpload = (
req: Request,
res: Response,
next: NextFunction
): void => {
try {
// Check if file exists in request
if (!req.file) {
SendResponse.error(res, 'No file uploaded', 400)
return
}

const file = req.file

// Validate file size
if (file.size > this.MAX_FILE_SIZE) {
SendResponse.error(
res,
`File size exceeds maximum limit of ${this.MAX_FILE_SIZE / (1024 * 1024)}MB`,
400
)
return
}

// Validate file size (minimum)
if (file.size === 0) {
SendResponse.error(res, 'Uploaded file is empty', 400)
return
}

// Validate file extension
const fileExtension = file.originalname
.toLowerCase()
.substring(file.originalname.lastIndexOf('.'))

if (!this.ALLOWED_EXTENSIONS.includes(fileExtension)) {
SendResponse.error(
res,
`Invalid file extension. Allowed extensions: ${this.ALLOWED_EXTENSIONS.join(', ')}`,
400
)
return
}

// Validate MIME type
if (!this.ALLOWED_MIME_TYPES.includes(file.mimetype)) {
SendResponse.error(
res,
`Invalid file type. Allowed types: ${this.ALLOWED_MIME_TYPES.join(', ')}`,
400
)
return
}

// All validations passed
next()
} catch (error: any) {
SendResponse.error(res, 'File validation failed', 400, error.message)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be redundant validation logic here. The multer configuration in DataSet.multer.ts already handles file size and MIME type validation through its limits and fileFilter options. When multer rejects a file, it raises an error that will likely bypass this middleware, preventing these custom error messages from being sent and leading to inconsistent error handling.

It's recommended to centralize this validation. A good approach is to let multer perform the checks and then use a dedicated error-handling middleware for multer errors to format and send user-friendly responses. This would remove the need for the redundant checks in this validateFileUpload function.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive multi-format file upload support to the LocalMind RAG system, expanding beyond CSV to include PDF, TXT, JSON, and TSV formats. The implementation introduces a robust file handling architecture with validation, automatic format detection, and clean separation of concerns.

Key Changes:

  • Implemented FileLoaderUtils class with format-specific document loaders for CSV, PDF, TXT, JSON, and TSV files
  • Added multer middleware for handling multipart/form-data file uploads with size limits and type validation
  • Introduced comprehensive TypeScript type definitions for file formats and processing results

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
DataSet.type.ts Defines TypeScript enums and interfaces for file formats, Q&A pairs, validation errors, and upload metadata
DataSet.fileLoader.ts Implements format detection and file loading logic for all supported formats using LangChain loaders
DataSet.multer.ts Configures multer with disk storage, file filtering, and 10MB size limits for secure file uploads
DataSet.validator.ts Provides middleware for validating file uploads by extension, MIME type, and size constraints
DataSet.controller.ts Updates upload handler to support multiple formats with automatic detection and adds formats endpoint
DataSet.routes.ts Changes upload endpoint from GET to POST with multer and validation middleware chain
README.md Comprehensive documentation covering usage, formats, validation rules, and testing examples
sample-qa.* Test sample files demonstrating expected format for each supported file type
package.json Adds pdf-parse, multer, and @types/multer dependencies
Files not reviewed (1)
  • LocalMind-Backend/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +18
/**
* @route POST /api/v1/dataset/upload
* @desc Upload and process a dataset file (CSV, PDF, TXT, JSON, TSV)
* @access Public (add authentication if needed)
*/
router.post(
'/upload',
upload.single('file'), // 'file' is the field name for the uploaded file
DataSetValidator.validateFileUpload,
DataSetController.uploadDataSet
)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The route is marked as "Public" in the comment, but file upload endpoints typically require authentication to prevent abuse. Consider adding authentication middleware to protect this endpoint from unauthorized uploads, DoS attacks, or malicious file uploads.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +87
// Validate file extension
const fileExtension = file.originalname
.toLowerCase()
.substring(file.originalname.lastIndexOf('.'))

if (!this.ALLOWED_EXTENSIONS.includes(fileExtension)) {
SendResponse.error(
res,
`Invalid file extension. Allowed extensions: ${this.ALLOWED_EXTENSIONS.join(', ')}`,
400
)
return
}

// Validate MIME type
if (!this.ALLOWED_MIME_TYPES.includes(file.mimetype)) {
SendResponse.error(
res,
`Invalid file type. Allowed types: ${this.ALLOWED_MIME_TYPES.join(', ')}`,
400
)
return
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file validation only checks MIME type and extension, but doesn't validate file content. Malicious users could upload files with correct extensions but harmful content. Consider adding content validation, especially for PDF and JSON files, to ensure they are well-formed and don't contain malicious payloads.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +21
filename: (req, file, cb) => {
// Generate unique filename with timestamp
const uniqueSuffix = Date.now() + '-' + Math.round(Math.random() * 1e9)
const ext = path.extname(file.originalname)
const basename = path.basename(file.originalname, ext)
cb(null, `${basename}-${uniqueSuffix}${ext}`)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename generation uses Math.random() which is not cryptographically secure. For file uploads that may contain sensitive data, consider using crypto.randomBytes() or crypto.randomUUID() to generate more secure unique identifiers that are harder to predict.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +100
xlsx: 'Excel spreadsheet (not yet fully supported)',
tsv: 'Tab-separated values file',
json: 'JSON file with Q&A pairs',
pdf: 'PDF document',
txt: 'Plain text file',
},
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description states "Excel spreadsheet (not yet fully supported)" which is misleading. The XLSX format is included in the enum and allowed MIME types, but the implementation explicitly throws an error. Either remove XLSX from the supported formats list until it's implemented, or provide partial support with clear documentation about limitations.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +72
// Clean up uploaded file on error
if (req.file && fs.existsSync(req.file.path)) {
fs.unlinkSync(req.file.path)
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file cleanup fails (e.g., due to permissions or file locks), the error is silently ignored. Consider logging cleanup failures so administrators can identify and resolve issues with orphaned files that couldn't be deleted.

Copilot uses AI. Check for mistakes.
import * as fs from 'fs'

// Ensure uploads directory exists
const uploadsDir = path.join(process.cwd(), 'uploads', 'datasets')
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uploads directory path is created using process.cwd() which can be fragile if the working directory changes at runtime. Consider using __dirname or a configuration-based approach to ensure the uploads directory is always relative to the application root, regardless of where the process is started from.

Suggested change
const uploadsDir = path.join(process.cwd(), 'uploads', 'datasets')
const uploadsDir = path.resolve(
__dirname,
'..',
'..',
'..',
'..',
'..',
'uploads',
'datasets'
)

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +72
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath)
}
SendResponse.error(
res,
'Unsupported file format. Supported formats: CSV, PDF, TXT, JSON, TSV',
400
)
return
}

// Load documents from file
const documents = await FileLoaderUtils.loadFile(filePath, fileFormat)

if (!documents || documents.length === 0) {
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath)
}
SendResponse.error(res, 'No data found in the uploaded file', 400)
return
}

const loader = new CSVLoader(filePath)
const documents = await loader.load()
// Process the dataset
const processedData = await DataSetService.Prepate_DataSet(documents)

const Prepare_dataSet = await DataSetService.Prepate_DataSet(documents)
// Clean up uploaded file after processing
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath)
}

SendResponse.success(
res,
'Dataset uploaded and processed successfully',
JSON.parse(Prepare_dataSet)
`Dataset uploaded and processed successfully (Format: ${fileFormat.toUpperCase()})`,
JSON.parse(processedData)
)
} catch (error: any) {
SendResponse.error(res, 'Failed to upload dataset', 500, error)
// Clean up uploaded file on error
if (req.file && fs.existsSync(req.file.path)) {
fs.unlinkSync(req.file.path)
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file cleanup logic is duplicated in multiple places (lines 33-35, 48-50, 59-61, 70-72). Consider extracting this into a helper function like cleanupFile(filePath: string) to reduce code duplication and ensure consistent cleanup behavior across all code paths.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +10
const uploadsDir = path.join(process.cwd(), 'uploads', 'datasets')
if (!fs.existsSync(uploadsDir)) {
fs.mkdirSync(uploadsDir, { recursive: true })
}

Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uploaded files are stored in uploads/datasets directory but there's no mechanism to clean up old files. If files fail during processing or users upload many files, this could lead to disk space issues. Consider implementing a cleanup job or using temporary storage with TTL (time-to-live) to automatically remove old uploaded files.

Suggested change
const uploadsDir = path.join(process.cwd(), 'uploads', 'datasets')
if (!fs.existsSync(uploadsDir)) {
fs.mkdirSync(uploadsDir, { recursive: true })
}
const uploadsDir = path.join(process.cwd(), 'uploads', 'datasets')
// TTL (time-to-live) for uploaded files (e.g., 24 hours)
const UPLOAD_TTL_MS = 24 * 60 * 60 * 1000
// Interval at which to run cleanup (e.g., every hour)
const CLEANUP_INTERVAL_MS = 60 * 60 * 1000
async function cleanupOldUploads() {
try {
// If the directory does not exist, nothing to clean up
if (!fs.existsSync(uploadsDir)) {
return
}
const now = Date.now()
const files = await fs.promises.readdir(uploadsDir)
await Promise.all(
files.map(async (file) => {
const filePath = path.join(uploadsDir, file)
try {
const stat = await fs.promises.stat(filePath)
if (stat.isFile() && now - stat.mtimeMs > UPLOAD_TTL_MS) {
await fs.promises.unlink(filePath)
}
} catch {
// Ignore errors for individual files to avoid impacting the main process
}
})
)
} catch {
// Swallow errors from the cleanup job to avoid crashing the application
}
}
if (!fs.existsSync(uploadsDir)) {
fs.mkdirSync(uploadsDir, { recursive: true })
}
// Run an initial cleanup on startup
cleanupOldUploads()
// Schedule periodic cleanup of old uploaded files
const cleanupInterval = setInterval(cleanupOldUploads, CLEANUP_INTERVAL_MS)
// In Node.js environments, unref prevents the timer from keeping the event loop alive
if (typeof (cleanupInterval as any).unref === 'function') {
;(cleanupInterval as any).unref()
}

Copilot uses AI. Check for mistakes.
export interface ValidationError {
row?: number
fieldName: string
error_message: string
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field name error_message uses snake_case which is inconsistent with the TypeScript convention of using camelCase for property names. Consider renaming to errorMessage for consistency with other fields like fieldName.

Suggested change
error_message: string
errorMessage: string

Copilot uses AI. Check for mistakes.
Comment on lines +201 to +202
public getFileMetadata(filePath: string): UploadedFileMetadata {
const stats = fs.statSync(filePath)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using synchronous file system operations (fs.readFileSync, fs.statSync) can block the Node.js event loop. Consider using async alternatives (fs.promises.readFile, fs.promises.stat) with await for better performance in the async methods loadTXT, loadJSON, and getFileMetadata.

Suggested change
public getFileMetadata(filePath: string): UploadedFileMetadata {
const stats = fs.statSync(filePath)
public async getFileMetadata(filePath: string): Promise<UploadedFileMetadata> {
const stats = await fs.promises.stat(filePath)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add multi-format file upload support for RAG (PDF, TXT, JSON, TSV)

1 participant