TextBlaster is a Rust-based tool designed for efficient, distributed processing of large text datasets stored in Parquet format. It leverages a producer-worker architecture using RabbitMQ for task queuing, enabling scalable text cleaning, filtering, and transformation pipelines.
This tool is particularly useful for preprocessing large NLP datasets, such as web scrapes or Common Crawl data, by applying quality filters (like C4 and Gopher heuristics) in parallel.
- Distributed Processing: Uses a producer/worker model with RabbitMQ for scalable task distribution.
- Parquet I/O: Reads input data from Parquet files and writes processed results back to Parquet.
- Configurable Pipeline: Define a sequence of processing steps (filters, transformations) to apply to each text document.
- Implemented Filters: Includes common NLP dataset quality filters:
C4QualityFilter: Basic quality checks (sentence count, word count per sentence, word length, punctuation).GopherRepetitionFilter: Detects and filters documents with excessive repetition of lines, paragraphs, or n-grams.GopherQualityFilter: Applies heuristics from the Gopher paper (word count, average word length, symbol ratios, stop words, etc.).
- Asynchronous: Built with
tokioandasync-traitfor efficient I/O and concurrent processing. - Command-Line Interface: Easy-to-use CLI powered by
clapfor configuring and running the producer and workers.
The system consists of two main components: a producer and one or more worker instances, communicating via RabbitMQ.
graph LR
A[Input Parquet File] --> P(Producer);
P -->|"Task (JSON Doc)"| Q1(RabbitMQ Task Queue);
Q1 -->|"Task (JSON Doc)"| W1(Worker 1);
Q1 -->|"Task (JSON Doc)"| W2(Worker 2);
Q1 -->|"Task (JSON Doc)"| WN(Worker N);
W1 -->|"Result (JSON Doc)"| Q2(RabbitMQ Results Queue);
W2 -->|"Result (JSON Doc)"| Q2;
WN -->|"Result (JSON Doc)"| Q2;
Q2 -->|"Result (JSON Doc)"| P;
P --> O[Output Parquet File];
- Producer:
- Reads
TextDocumentrecords from the input Parquet file. - Serializes each document into JSON.
- Publishes each JSON document as a task message to the RabbitMQ
task_queue. - Waits for results from the
results_queue. - Consumes result messages (processed JSON documents).
- Deserializes results and writes them in batches to the output Parquet file.
- Reads
- Worker(s):
- Connects to RabbitMQ and consumes task messages from the
task_queue. - Deserializes the JSON task message back into a
TextDocument. - Processes the document through a pre-defined pipeline of
ProcessingSteps (e.g., filters). - If the document passes all steps, it serializes the processed
TextDocumentinto JSON. - Publishes the JSON result message to the RabbitMQ
results_queue. - Acknowledges the original task message. If processing fails or a filter removes the document, the task is still acknowledged, but no result is sent.
- Connects to RabbitMQ and consumes task messages from the
- Rust: Version 1.65 or later (due to
edition = "2021"and dependencies). Install via rustup. - Cargo: Included with Rust installation.
- RabbitMQ Server: A running RabbitMQ instance. You can run one easily using Docker:
The default connection string
docker run -d --hostname my-rabbit --name some-rabbit -p 5672:5672 -p 15672:15672 rabbitmq:3-management
amqp://guest:guest@localhost:5672/%2fshould work with this setup.
- Clone the repository:
git clone https://github.com/kris927b/TextBlaster.git cd TextBlaster - Build the binaries:
This will create optimized executables at
cargo build --release
target/release/producerandtarget/release/worker.
Configuration is primarily handled via command-line arguments for both the producer and worker.
- Producer: Specify input/output paths, column names, and RabbitMQ details.
- Worker: Specify RabbitMQ details and queue names.
Pipeline Configuration: The sequence of processing steps applied by the worker is defined in the config/pipeline_config.yaml file. This allows for dynamic configuration of the pipeline without recompiling the worker binary.
The pipeline_config.yaml file specifies a list of processing steps to be executed in order. Each step is defined by its type and any necessary parameters.
Example pipeline_config.yaml:
steps:
- type: C4QualityFilter
parameters:
min_sentences: 3
min_words_per_sentence: 5
max_word_length: 100
must_end_with_punct: true
- type: GopherRepetitionFilter
parameters:
# ... gopher repetition parameters ...
- type: LanguageDetectionFilter
parameters:
languages: ["en"]To customize the pipeline, modify the config/pipeline_config.yaml file to add, remove, reorder steps, or change filter parameters.
Input Parquet Config: The ParquetInputConfig struct (src/config.rs) defines how the producer reads the input Parquet file (path, column names, batch size). These values are passed via CLI arguments to the producer.
Ensure your RabbitMQ server is running before starting the producer or workers.
-
Start Worker(s): Open one or more terminal windows and run the worker executable. You can run multiple workers concurrently on the same or different machines to parallelize processing.
./target/release/worker \ --amqp-addr "amqp://guest:guest@localhost:5672/%2f" \ --task-queue "task_queue" \ --results-queue "results_queue" \ --prefetch-count 10 # Optional: Adjust based on task processing time--amqp-addr(or-a): RabbitMQ connection string.--task-queue(or-t): Name of the queue to consume tasks from (default:task_queue).--results-queue(or-r): Name of the queue to publish results to (default:results_queue).--prefetch-count: How many messages a worker receives at once.
-
Start the Producer: In a separate terminal, run the producer executable to start reading the input file and dispatching tasks.
./target/release/producer \ --input-file "/path/to/your/input_data.parquet" \ --output-file "/path/to/your/processed_output.parquet" \ --text-column "text" \ --id-column "document_id" `# Optional: Specify if you have an ID column` \ --amqp-addr "amqp://guest:guest@localhost:5672/%2f" \ --task-queue "task_queue" \ --results-queue "results_queue"--input-file(or-i): Path to the input Parquet file.--output-file(or-o): Path where the processed Parquet file will be written (default:output_processed.parquet).--text-column: Name of the column containing the main text content (default:text).--id-column: (Optional) Name of the column to use as the document ID. If not provided, IDs are generated based on file path and row number.--amqp-addr(or-a): RabbitMQ connection string.--task-queue(or-q): Name of the queue to publish tasks to (default:task_queue).--results-queue(or-r): Name of the queue to consume results from (default:results_queue).
The producer will read the input, send tasks, wait for all corresponding results, write the output Parquet file, and then exit. Workers will continue running until stopped manually (e.g., with Ctrl+C).
Both the producer and worker binaries expose an optional Prometheus metrics HTTP endpoint. This allows you to monitor the application's performance and state using Prometheus and visualization tools like Grafana.
To enable the metrics endpoint, use the --metrics-port command-line argument when starting the producer or worker:
# Run producer with metrics on port 8080
./target/release/producer --input-file ... --metrics-port 8080
# Run worker with metrics on port 8081
./target/release/worker --metrics-port 8081Replace 8080 and 8081 with your desired ports.
Once enabled, you can access the metrics by sending an HTTP GET request to the /metrics path on the specified port (e.g., http://localhost:8080/metrics or http://your_server_ip:8081/metrics).
The exposed metrics include (but are not limited to):
Producer Metrics:
producer_tasks_published_total: Total number of tasks successfully published to the queue.producer_task_publish_errors_total: Total errors during task publishing (serialization, broker issues).producer_results_received_total: Total number of outcomes received from the results queue.producer_results_success_total: Total number of successfully processed documents (from outcomes).producer_results_filtered_total: Total number of documents filtered by workers (from outcomes).producer_result_deserialization_errors_total: Total errors deserializing outcome messages.producer_active_tasks_in_flight: Number of tasks published but not yet resolved by workers.producer_task_publishing_duration_seconds: Histogram of task publishing latencies.
Worker Metrics:
worker_tasks_processed_total: Total number of tasks successfully processed by the worker pipeline.worker_tasks_filtered_total: Total number of tasks filtered out by the worker pipeline.worker_tasks_failed_total: Total number of tasks that resulted in a pipeline error.worker_task_deserialization_errors_total: Total errors deserializing incoming task messages.worker_outcome_publish_errors_total: Total errors publishing outcome messages back to the producer.worker_task_processing_duration_seconds: Histogram of task processing durations.worker_active_processing_tasks: Number of tasks currently being processed concurrently by the worker.
The core processing logic resides in the pipeline executor (src/executor.rs) which runs a series of steps implementing the ProcessingStep trait. Each step takes a TextDocument and returns a Result<TextDocument>. If any step returns an Err, processing for that document stops. A specific PipelineError::DocumentFiltered error is used to indicate that a document was intentionally filtered out.
The current pipeline executed by the worker (in src/bin/worker.rs) includes:
C4QualityFilter: (src/pipeline/filters/c4_filters.rs) Checks for minimum sentence count, minimum words per sentence, maximum word length, and sentence-ending punctuation.GopherRepetitionFilter: (src/pipeline/filters/gopher_rep.rs) Filters documents based on thresholds for duplicate lines/paragraphs (by count and character fraction) and duplicate n-grams.GopherQualityFilter: (src/pipeline/filters/gopher_quality.rs) Filters documents based on word count, average word length, symbol ratios, bullet/ellipsis line ratios, alphabetic ratio, and stop-word presence.LanguageDetectionFilter: (src/pipeline/filters/language_filter.rs) Filters documents based on language. Uses thewhatlangpackage for detection.
Refer to the source files in src/pipeline/filters/ for detailed implementation and parameter explanations.
The central data structure processed by the pipeline is TextDocument (src/data_model.rs):
pub struct TextDocument {
pub id: String, // Unique identifier
pub source: String, // Original source (e.g., input file path)
pub content: String, // The main text content
pub metadata: HashMap<String, String>, // Key-value store for metadata or intermediate results
}When writing to the output Parquet file, the metadata field is serialized to a JSON string.
The roadmap for TextBlaster is divided into 6 phases, with subtasks to complete. Each subtask is supposed to be a self isolated PR, and once the phase is done a new major version can be released. Each phase is denoted by a theme, that explains the grouping of sub-tasks.
Theme: Solidify the core application by enhancing stability, standardizing practices, and improving the development loop. This phase is about building a reliable and maintainable foundation.
- Comprehensive Unit & Integration Testing: Achieve >80% test coverage by writing unit tests for all filter logic and integration tests for the full producer-worker workflow. Use
testcontainersto spin up a live RabbitMQ instance for end-to-end verification. - CI/CD Pipeline Enforcement: Enhance the GitHub Actions workflow to run
cargo fmt --checkandcargo clippy -- -D warningson all pull requests. Addcargo auditto check for security vulnerabilities and automate release drafts upon tagging. - Done - Standardized Logging and Tracing: Fully integrate the
tracingcrate, replacing allprintln!calls. Structure logs as JSON and adddoc_idto tracing spans to correlate all messages for a specific document. Allow log level configuration via CLI. - Done - Refined Error Handling & Messaging: Improve the clarity of all user-facing error messages. Ensure that when a task fails in a worker, the propagated error clearly identifies the worker ID, the failing step, and the root cause.
- Comprehensive Documentation: Create a
docs/folder with Markdown guides on architecture, configuration, and tutorials. AddCONTRIBUTING.mdfor new developers and ensure all public functions and structs have detailed doc comments (///). - Configuration Validation: Implement a
--validate-configflag and an automatic startup check to validate thepipeline_config.yaml. This check will catch syntax errors, schema violations, and logical issues (e.g., a filter running before its dependency). - Done
Theme: Improve the user experience and expand the core capabilities of the processing pipeline, making it more flexible and powerful for common tasks.
- Improved CLI with Subcommands & Dry-Run Mode: Refactor the CLI using
clapto support subcommands (producer run,worker run,validate-config). Add a--dry-runflag to the producer to process a small sample and print results to the console without writing files. - Introduce Transformation Steps: Extend the
ProcessingSteptrait to allow in-place modification of document content, not just filtering. Implement aTextNormalizationStep(e.g., lowercase, NFC normalization, whitespace removal) as the first example. - Abstract I/O Layer for Multiple Data Formats: Refactor I/O logic behind
DocumentReaderandDocumentWritertraits. Implement support for reading and writing additional formats like JSON Lines (.jsonl) and CSV, selectable via CLI arguments. - Support for Rich, Nested Metadata: Upgrade the
TextDocumentmetadata fromHashMap<String, String>toHashMap<String, serde_json::Value>to allow for complex, nested metadata structures from various data sources. - Implement Graceful Shutdown: Ensure workers can gracefully complete in-flight tasks and acknowledge their messages before shutting down when receiving a termination signal (SIGTERM), minimizing data loss.
- Local Development Environment: Create a
docker-compose.ymlfile and a.envfile template to simplify the setup of a complete local development environment (RabbitMQ, producer, worker) with a single command.
Theme: Identify and eliminate performance bottlenecks, optimize resource usage, and introduce stateful capabilities to prepare the tool for massive-scale datasets.
- Comprehensive Performance Profiling: Use tools like
flamegraph,dhat, andtokio-consoleto profile CPU and memory hotspots in both the producer and worker. Instrument each pipeline step withtracingspans to precisely measure its performance. - Optimize I/O with Asynchronous Operations: Refactor the Parquet reader and writer to use
tokio::fsand the asynchronous capabilities of thearrowandparquetcrates, eliminating blocking I/O. - Optimize Message Payloads: Investigate and implement strategies to reduce network overhead. Options include optional payload compression (e.g., Zstd) or a "pointer-based" mode where data is placed in shared storage and the message queue only carries a pointer.
- Implement Job Progress Tracking & Resumability: Implement a mechanism for the producer to save its state (e.g., last processed row ID). Add a
--resumeflag to allow it to recover from a crash and continue processing without starting over. - Introduce a Stateful Processing Step: Integrate a key-value store like Redis or RocksDB and create a
StatefulProcessingSteptrait. This will enable stateful filters like document deduplication. - Worker Concurrency & Throughput Tuning: Expose configuration to tune the number of concurrent processing tasks within a single worker, in addition to the RabbitMQ
prefetch_count, to maximize resource utilization.
Theme: Evolve TextBlaster from a generic tool to a specialized platform for NLP data preparation by adding sophisticated, high-value processing steps.
- Implement Stateful Document Deduplication: Create a
DeduplicationFilterusing MinHash or SimHash algorithms. This step will leverage the stateful processing foundation to identify and filter near-duplicate documents across the entire dataset. - Add a PII Redaction & Filtering Step: Implement a
PiiFilterto detect and either remove documents containing Personally Identifiable Information or redact it (e.g., replacing emails with[REDACTED]). - Implement Advanced NLP-Based Filters: Add a suite of more intelligent filters, potentially using
ortfor ONNX model inference. Examples include aPerplexityFilterfor non-sensical text, aToxicityFilter, and aQualityScoreFilter. - Support for Conditional Pipeline Execution: Enhance the pipeline executor to support conditional logic in the configuration file, allowing a step to run only if a specific metadata condition is met (e.g.,
if: metadata.language == "en"). - Implement Data Sampling and Splitting: Add a
SamplingStepfor random or stratified sampling. Enhance the producer/writer to split output into multiple files (e.g., train/val/test) based on flags added during processing. - Dead Letter Queue (DLQ) Management: Configure RabbitMQ to route messages that repeatedly fail to a Dead Letter Queue. Provide a utility command to inspect, re-queue, or discard messages from the DLQ.
Theme: Position TextBlaster within the broader MLOps and data engineering ecosystem, enabling deployment in modern, cloud-native environments.
- Enable Cloud Storage Integration (S3, GCS, etc.): Use a library like
object_storeto enable the producer and writer to read from and write to cloud storage buckets (e.g.,s3://my-bucket/data.parquet) natively. - Containerize the Application: Create official
Dockerfiles for theproducer,worker, and other binaries. Publish the images to a public container registry like GitHub Container Registry or Docker Hub. - Develop Kubernetes Deployment Configurations: Create a Helm chart or Kustomize configuration to simplify the deployment, scaling, and management of a TextBlaster cluster on Kubernetes.
- Abstract the Message Broker for Alternative Queues: Refactor the RabbitMQ logic behind a
MessageBrokertrait and add a new implementation for another system like Apache Kafka or Redis Streams, making the queuing system configurable. - Decouple Result Aggregation: Create a new
aggregatorbinary responsible for consuming processed results and writing the final output files. This decouples the producer from the writing bottleneck, improving overall system throughput. - Develop a Python Client Library: Create a thin Python wrapper that allows users to configure, launch, and monitor TextBlaster jobs from Python scripts or Jupyter notebooks, significantly lowering the barrier to entry for data scientists.
Theme: Finalize the user experience with advanced monitoring, reporting, and ease-of-use features that make the tool a pleasure to operate at scale.
- Create a Web UI for Monitoring: Develop a simple, optional web dashboard (e.g., using Axum) that consumes the Prometheus metrics endpoint to visualize key metrics like processing rates, queue depths, error rates, and worker status in real-time.
- Enhance Operational Feedback & Reporting: Improve the CLI progress bars (
indicatif) to show live counts of succeeded, filtered, and errored tasks. At the end of a run, generate arun_summary.jsonfile with final metrics and per-step filter counts. - Create a 'Cookbook' of Pipeline Recipes: Add a
recipes/directory to the repository with documented, ready-to-usepipeline_config.yamlfiles for common use cases (e.g., "FineWeb-style cleaning," "PII Redaction," "Deduplication"). - Explore Dynamic Plugin Loading: Investigate a plugin system (e.g., loading from
.so/.dllfiles) that would allow users to compile and use their own custom Rust-based filters without needing to recompile the main TextBlaster binaries. - Implement Document-Level Aggregations: Design a new "aggregator" mode or binary that can perform a second pass over the processed data to compute global statistics like TF-IDF scores or a complete dataset vocabulary.
- Launch a Project Website: Create a polished GitHub Pages or standalone website with comprehensive tutorials, the cookbook, benchmark results, and full API documentation to serve as the central hub for the user community.
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs, feature requests, or improvements.
For contribution guidelines, please see CONTRIBUTING.md.
This project is licensed under:
- Apache License, Version 2.0, (LICENSE)