Skip to content

Add streaming import/export and file-ingestion paths to avoid whole-dataset buffering in the Node client #3

@cferrys

Description

@cferrys

Summary

The Node client appears to expose bulk-oriented document operations (Documents.import(), Documents.export(), Documents.addPDF(), Documents.addCSV()) while emphasizing high performance and zero external runtime dependencies. The highest-impact gap is the lack of an explicit streaming/backpressure-aware ingestion path: large imports and parsed file payloads are likely materialized fully in memory before being sent over HTTP, which will become the main bottleneck long before hlquery or RocksDB does.

Context

The recent README expansion documents a broad document pipeline surface area in README.md: bulk import/export, local PDF parsing, local CSV parsing, and helper methods that normalize extracted content before indexing. That points directly at lib/Documents.js, lib/Request.js, and utils/Validator.js as the core execution path for potentially large payloads.

This matters because hlquery is positioned as a high-performance search engine/database wrapper around RocksDB. If the Node client buffers entire CSVs, PDFs, export responses, or bulk document arrays in process memory, then the client becomes the throughput and reliability ceiling for indexing jobs. A single large import can cause excessive heap growth, long GC pauses, request retries with oversized bodies, and poor behavior under concurrent ingestion workloads. The first-commit state plus the README’s “tests (for future use)” note also suggests this area may have grown faster than its performance validation coverage.

Proposed Implementation

  1. Add a streaming request mode in lib/Request.js that accepts Readable bodies, supports chunked transfer, and preserves backpressure instead of serializing everything up front.
  2. Extend lib/Documents.js with explicit streaming APIs, for example importStream(), exportStream(), and chunked helpers for large arrays so callers can choose bounded-memory ingestion.
  3. Refactor addCSV() to parse rows incrementally and flush documents in configurable batches rather than building one large in-memory payload.
  4. Refactor addPDF() to support a bounded-size ingestion path for extracted text and metadata, with clear limits and failure modes when a file exceeds configured thresholds.
  5. Add config knobs for batch size, max in-flight bytes, request timeout, and retry policy so ingestion can be tuned for different deployment profiles.
  6. Add tests and benchmarks that cover large imports/exports, memory usage under load, and concurrent ingestion to prevent regressions.

Impact

This directly improves the most important property of a client for a high-performance search system: sustained ingestion throughput without client-side instability. Bounded-memory streaming will reduce heap pressure, improve reliability for large indexing jobs, make the Node client usable for real production backfills, and align the client’s behavior with hlquery’s performance-oriented positioning.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions