Skip to content

New Feature: Apply NLTK #6

@ItsRikan

Description

@ItsRikan

Problem

DsKit currently does not provide a high-level text data cleaning pipeline.
Although the full version installs NLTK, there is no built-in NLTK-based pre-processing utility for text/NLP workflows. This forces users to repeatedly implement custom cleaning logic outside the library.

Proposed Solution

Introduce a flexible, high-level text cleaning function named apply_nltk that enables NLP/text preprocessing directly within DsKit.

The function

  • Uses NLTK internally

  • Offers fine-grained control to users (case handling, stopwords, token processing, etc.)

  • Is configurable and reusable across NLP pipelines

  • Reduces boilerplate code for common text-cleaning tasks

  • This would significantly improve DsKit’s usability for NLP and text-heavy datasets.

Current Progress

✅ Feature apply_nltk has already been implemented

✅ A Pull Request is open

🔄 Open to feedback and ready to revise:

  • Code structure

  • API design

  • Contribution-guideline compliance

  • Complexity or performance concerns

Why This Matters

  • Makes DsKit more NLP-friendly out of the box

  • Encourages standardized text preprocessing

  • Reduces repetitive user-side implementations

  • Aligns with DsKit’s goal of simplifying data preparation workflows

Related

Pull Request: #2 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions