-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Problem
DsKit currently does not provide a high-level text data cleaning pipeline.
Although the full version installs NLTK, there is no built-in NLTK-based pre-processing utility for text/NLP workflows. This forces users to repeatedly implement custom cleaning logic outside the library.
Proposed Solution
Introduce a flexible, high-level text cleaning function named apply_nltk that enables NLP/text preprocessing directly within DsKit.
The function
-
Uses NLTK internally
-
Offers fine-grained control to users (case handling, stopwords, token processing, etc.)
-
Is configurable and reusable across NLP pipelines
-
Reduces boilerplate code for common text-cleaning tasks
-
This would significantly improve DsKit’s usability for NLP and text-heavy datasets.
Current Progress
✅ Feature apply_nltk has already been implemented
✅ A Pull Request is open
🔄 Open to feedback and ready to revise:
-
Code structure
-
API design
-
Contribution-guideline compliance
-
Complexity or performance concerns
Why This Matters
-
Makes DsKit more NLP-friendly out of the box
-
Encourages standardized text preprocessing
-
Reduces repetitive user-side implementations
-
Aligns with DsKit’s goal of simplifying data preparation workflows
Related
Pull Request: #2 (comment)