BongoTokenizer

Data preparation

Wikipedia Bangla Articles:
- Downloaded data bangla corpus from wiki-dump https://dumps.wikimedia.org/bnwiki/latest/
- This data stored as csv format and stored in google drive
- Each raw, we have 4 column:
  - Id: Incremental integer number
  - Title: Title of that wiki article
  - Text: Whole article as text format
  - URL: Url to access that artcle
- In this project, only Text column is being used
Bangla Novels Data For Colloqual Texts
- Downloaded 20+ novel data from https://archive.org/details/booksbylanguage_bengali
- Open source kaggle data from https://www.kaggle.com/datasets/aagalib/complete-works-of-rabindranath-tagore
- This data is already text format but there are many blank lines and spaces which has been removed in data pre-process stage
- Stored data in Google Drive

After collecting data, all the combined in a text file and process for training. Comprehensive Bangla language tokenizer

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
BongoTokenizer.ipynb		BongoTokenizer.ipynb
BongoTokenizerBPE.ipynb		BongoTokenizerBPE.ipynb
BongoTokenizer_DataProcessing.ipynb		BongoTokenizer_DataProcessing.ipynb
BongoTokenizer_SentencePiece.ipynb		BongoTokenizer_SentencePiece.ipynb
README.md		README.md
testing_with_different_tokenizers.ipynb		testing_with_different_tokenizers.ipynb