Skip to content

dontbeidle/stopwords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

KAA Stopwords β€” Version 1

The first curated stopword list for the Karakalpak language (Latin script)


Overview

This repository contains a list of Karakalpak stopwords, collected primarily from formal written sources such as educational and reference materials. It is designed for Natural Language Processing (NLP) and text preprocessing tasks such as:

Cleaning text datasets/corpora for Karakalpak
Preprocessing for training NLP models or chatbots
Word frequency and statistical analysis
Improving search indexing and IR systems
machine translation


About

  • Language: Karakalpak (Latin script)
  • Size: 402 unique words (one per line, UTF-8 encoded)
  • Source: Extracted and compiled from formal written content, later normalized into a single list.

Usage Examples

with open('kaa_stopwords_v1.txt', encoding='utf-8') as f:
    stopwords = set(f.read().splitlines())

text = "Men qaraqalpaq tilinde NLP jasayman"
filtered = [word for word in text.split() if word.lower() not in stopwords]
print(filtered)

Contributing

This is Version 1 of the Karakalpak stopword list. Future releases may include:

Expanded coverage
More informal / naturalized vocabulary
Refinements and corrections

License

Released under the MIT License – free to use, modify, and share with attribution.

About

πŸ“ | Stopwords in Karakalpak

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •