-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Problem
DsKit lacks a simple utility to generate a vocabulary from text columns, which is a common preprocessing step in NLP workflows.
Proposed Solution
- Introduce generate_vocabulary(df:pd.DataFrame,text_col:str,case:Literal['lower','upper']=None)
- Returns a list of unique words present in the specified text column.
Progress
✅ Both features are already implemented
🔄 Open to feedback and ready to refactor or split PRs if needed
Use-Case
- Useful when developing Bag of Words like an ML model
- Provide knowledge about a specific text field
Pull Request
Metadata
Metadata
Assignees
Labels
No labels