-
Notifications
You must be signed in to change notification settings - Fork 44
feat: Add multi-format file upload support for RAG (PDF, TXT, JSON, TSV) #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,27 +1,106 @@ | ||||||||||||||||||||||||||
| import { Request, Response } from 'express' | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| import { CSVLoader } from '@langchain/community/document_loaders/fs/csv' | ||||||||||||||||||||||||||
| import path from 'path' | ||||||||||||||||||||||||||
| import * as fs from 'fs' | ||||||||||||||||||||||||||
| import { SendResponse } from '../../../../utils/SendResponse.utils' | ||||||||||||||||||||||||||
| import DataSetService from './DataSet.service' | ||||||||||||||||||||||||||
| import FileLoaderUtils from './DataSet.fileLoader' | ||||||||||||||||||||||||||
| import { FileFormat } from './DataSet.type' | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| class DataSetController { | ||||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||||
| * Upload and process dataset with support for multiple file formats | ||||||||||||||||||||||||||
| * Supports: CSV, PDF, TXT, JSON, TSV | ||||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||||
| public async uploadDataSet(req: Request, res: Response): Promise<void> { | ||||||||||||||||||||||||||
| try { | ||||||||||||||||||||||||||
| const filePath = path.join(path.resolve(), 'src', 'data', 'Sample.csv') | ||||||||||||||||||||||||||
| // Check if file was uploaded | ||||||||||||||||||||||||||
| if (!req.file) { | ||||||||||||||||||||||||||
| SendResponse.error(res, 'No file uploaded', 400) | ||||||||||||||||||||||||||
| return | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| const uploadedFile = req.file | ||||||||||||||||||||||||||
| const filePath = uploadedFile.path | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| // Detect file format | ||||||||||||||||||||||||||
| const fileFormat = FileLoaderUtils.detectFileFormat( | ||||||||||||||||||||||||||
| uploadedFile.originalname, | ||||||||||||||||||||||||||
| uploadedFile.mimetype | ||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| if (!fileFormat) { | ||||||||||||||||||||||||||
| // Clean up uploaded file | ||||||||||||||||||||||||||
| if (fs.existsSync(filePath)) { | ||||||||||||||||||||||||||
| fs.unlinkSync(filePath) | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
| SendResponse.error( | ||||||||||||||||||||||||||
| res, | ||||||||||||||||||||||||||
| 'Unsupported file format. Supported formats: CSV, PDF, TXT, JSON, TSV', | ||||||||||||||||||||||||||
| 400 | ||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||
| return | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| // Load documents from file | ||||||||||||||||||||||||||
| const documents = await FileLoaderUtils.loadFile(filePath, fileFormat) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| if (!documents || documents.length === 0) { | ||||||||||||||||||||||||||
| if (fs.existsSync(filePath)) { | ||||||||||||||||||||||||||
| fs.unlinkSync(filePath) | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
| SendResponse.error(res, 'No data found in the uploaded file', 400) | ||||||||||||||||||||||||||
| return | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| const loader = new CSVLoader(filePath) | ||||||||||||||||||||||||||
| const documents = await loader.load() | ||||||||||||||||||||||||||
| // Process the dataset | ||||||||||||||||||||||||||
| const processedData = await DataSetService.Prepate_DataSet(documents) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| const processedData = await DataSetService.Prepate_DataSet(documents) | |
| const processedData = await (DataSetService as any)['Prepate_DataSet'](documents) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using synchronous file system operations (fs.existsSync, fs.unlinkSync, fs.readFileSync) can block the Node.js event loop. Consider using async alternatives (fs.promises.access, fs.promises.unlink, fs.promises.readFile) with try-catch blocks for better performance, especially since this is already in an async function.
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file cleanup fails (e.g., due to permissions or file locks), the error is silently ignored. Consider logging cleanup failures so administrators can identify and resolve issues with orphaned files that couldn't be deleted.
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file cleanup logic is duplicated in multiple places (lines 33-35, 48-50, 59-61, 70-72). Consider extracting this into a helper function like cleanupFile(filePath: string) to reduce code duplication and ensure consistent cleanup behavior across all code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing raw error messages to the client can be a security risk, as it might leak sensitive information about the application's internals (e.g., file paths, library issues). It's better to log the detailed error on the server for debugging and send a generic error message to the client.
| SendResponse.error( | |
| res, | |
| 'Failed to upload and process dataset', | |
| 500, | |
| error.message | |
| ) | |
| console.error('Failed to upload and process dataset:', error); | |
| SendResponse.error( | |
| res, | |
| 'Failed to upload and process dataset', | |
| 500 | |
| ) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description states "Excel spreadsheet (not yet fully supported)" which is misleading. The XLSX format is included in the enum and allowed MIME types, but the implementation explicitly throws an error. Either remove XLSX from the supported formats list until it's implemented, or provide partial support with clear documentation about limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getSupportedFormats endpoint currently lists 'xlsx' as a supported format, but the implementation in DataSet.fileLoader.ts throws an error because it's not yet implemented. This can be misleading for API consumers. It's better to remove 'xlsx' from the list of supported formats until it is fully functional.
const formats = Object.values(FileFormat).filter(
(f) => f !== FileFormat.XLSX
)
SendResponse.success(res, 'Supported file formats', {
formats,
description: {
csv: 'Comma-separated values file',
tsv: 'Tab-separated values file',
json: 'JSON file with Q&A pairs',
pdf: 'PDF document',
txt: 'Plain text file',
},
})| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,237 @@ | ||||||||||||||
| import { CSVLoader } from '@langchain/community/document_loaders/fs/csv' | ||||||||||||||
| import { PDFLoader } from '@langchain/community/document_loaders/fs/pdf' | ||||||||||||||
| import { Document } from '@langchain/core/documents' | ||||||||||||||
| import { FileFormat, UploadedFileMetadata } from './DataSet.type' | ||||||||||||||
| import * as fs from 'fs' | ||||||||||||||
| import * as path from 'path' | ||||||||||||||
|
|
||||||||||||||
| /** | ||||||||||||||
| * File loader class to handle different file formats | ||||||||||||||
| */ | ||||||||||||||
| class FileLoaderUtils { | ||||||||||||||
| /** | ||||||||||||||
| * Detect file format from file extension or MIME type | ||||||||||||||
| */ | ||||||||||||||
| public detectFileFormat(filename: string, mimeType?: string): FileFormat | null { | ||||||||||||||
| const extension = path.extname(filename).toLowerCase().slice(1) | ||||||||||||||
|
|
||||||||||||||
| const formatMap: Record<string, FileFormat> = { | ||||||||||||||
| csv: FileFormat.CSV, | ||||||||||||||
| xlsx: FileFormat.XLSX, | ||||||||||||||
| xls: FileFormat.XLSX, | ||||||||||||||
| tsv: FileFormat.TSV, | ||||||||||||||
| json: FileFormat.JSON, | ||||||||||||||
| pdf: FileFormat.PDF, | ||||||||||||||
| txt: FileFormat.TXT, | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| // Try by extension first | ||||||||||||||
| if (formatMap[extension]) { | ||||||||||||||
| return formatMap[extension] | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| // Try by MIME type | ||||||||||||||
| if (mimeType) { | ||||||||||||||
| const mimeFormatMap: Record<string, FileFormat> = { | ||||||||||||||
| 'text/csv': FileFormat.CSV, | ||||||||||||||
| 'application/vnd.ms-excel': FileFormat.XLSX, | ||||||||||||||
| 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': | ||||||||||||||
| FileFormat.XLSX, | ||||||||||||||
| 'text/tab-separated-values': FileFormat.TSV, | ||||||||||||||
| 'application/json': FileFormat.JSON, | ||||||||||||||
| 'application/pdf': FileFormat.PDF, | ||||||||||||||
| 'text/plain': FileFormat.TXT, | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| if (mimeFormatMap[mimeType]) { | ||||||||||||||
| return mimeFormatMap[mimeType] | ||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| return null | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /** | ||||||||||||||
| * Load documents from a CSV file | ||||||||||||||
| */ | ||||||||||||||
| private async loadCSV(filePath: string): Promise<Document[]> { | ||||||||||||||
| const loader = new CSVLoader(filePath) | ||||||||||||||
| return await loader.load() | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /** | ||||||||||||||
| * Load documents from a PDF file | ||||||||||||||
| */ | ||||||||||||||
| private async loadPDF(filePath: string): Promise<Document[]> { | ||||||||||||||
| const loader = new PDFLoader(filePath, { | ||||||||||||||
| splitPages: false, // Load entire PDF as one document | ||||||||||||||
| }) | ||||||||||||||
| return await loader.load() | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /** | ||||||||||||||
| * Load documents from a TXT file | ||||||||||||||
| */ | ||||||||||||||
| private async loadTXT(filePath: string): Promise<Document[]> { | ||||||||||||||
| try { | ||||||||||||||
| const fileContent = fs.readFileSync(filePath, 'utf-8') | ||||||||||||||
|
|
||||||||||||||
| // Split by double newlines to separate Q&A pairs or paragraphs | ||||||||||||||
| const sections = fileContent | ||||||||||||||
| .split(/\n\n+/) | ||||||||||||||
| .filter((section) => section.trim().length > 0) | ||||||||||||||
|
|
||||||||||||||
| if (sections.length === 0) { | ||||||||||||||
| // If no double newlines, treat entire content as one document | ||||||||||||||
| return [ | ||||||||||||||
| new Document({ | ||||||||||||||
| pageContent: fileContent, | ||||||||||||||
| metadata: { source: filePath }, | ||||||||||||||
| }), | ||||||||||||||
| ] | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| // Create a document for each section | ||||||||||||||
| const documents = sections.map((section, index) => { | ||||||||||||||
| return new Document({ | ||||||||||||||
| pageContent: section.trim(), | ||||||||||||||
| metadata: { source: filePath, section: index }, | ||||||||||||||
| }) | ||||||||||||||
| }) | ||||||||||||||
|
|
||||||||||||||
| return documents | ||||||||||||||
| } catch (error) { | ||||||||||||||
| throw new Error(`Failed to load TXT file: ${error}`) | ||||||||||||||
|
||||||||||||||
| throw new Error(`Failed to load TXT file: ${error}`) | |
| throw new Error(`Failed to load TXT file: ${(error as Error).message}`) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling at lines 139 and 194 wraps errors in a way that loses the original error stack trace. Consider preserving the original error information by using throw new Error(\Failed to parse JSON file: ${(error as Error).message}`, { cause: error })` or similar pattern to maintain debugging context.
| throw new Error(`Failed to load file: ${error}`) | |
| if (error instanceof Error) { | |
| throw new Error(`Failed to load file: ${error.message}`, { cause: error }) | |
| } | |
| throw new Error(`Failed to load file: ${String(error)}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping the caught error object directly in a new Error constructor will stringify it (often to [object Object]), losing valuable information like the original error's stack trace and type. To preserve this information for better debugging, you should re-throw the original error or create a new error that includes the original error's message.
| } catch (error) { | |
| throw new Error(`Failed to load file: ${error}`) | |
| } | |
| } catch (error: any) { | |
| throw new Error(`Failed to load file: ${error.message}`) | |
| } |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using synchronous file system operations (fs.readFileSync, fs.statSync) can block the Node.js event loop. Consider using async alternatives (fs.promises.readFile, fs.promises.stat) with await for better performance in the async methods loadTXT, loadJSON, and getFileMetadata.
| public getFileMetadata(filePath: string): UploadedFileMetadata { | |
| const stats = fs.statSync(filePath) | |
| public async getFileMetadata(filePath: string): Promise<UploadedFileMetadata> { | |
| const stats = await fs.promises.stat(filePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using
fs.unlinkSyncis synchronous and blocks the Node.js event loop. In a server environment, this can degrade performance, especially under load. It's recommended to use the asynchronous version,fs.promises.unlink, to avoid blocking. This comment also applies to the other uses ofunlinkSyncin this file (lines 49, 60, and 71).