Skip to content

Feature Enhancement: Vocabulary Generation #8

@ItsRikan

Description

@ItsRikan

Problem

DsKit lacks a simple utility to generate a vocabulary from text columns, which is a common preprocessing step in NLP workflows.

Proposed Solution

  • Introduce generate_vocabulary(df:pd.DataFrame,text_col:str,case:Literal['lower','upper']=None)
  • Returns a list of unique words present in the specified text column.

Progress

✅ Both features are already implemented

🔄 Open to feedback and ready to refactor or split PRs if needed

Use-Case

  • Useful when developing Bag of Words like an ML model
  • Provide knowledge about a specific text field

Pull Request

#2 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions