GitHub - pavandeekshith/Telugu-LLM: This repo contains the work done as part of SRIP Project at IIT Gandhinagar

Note: All the codes are in Codes folder.

Data Curation:

Codes for scraping data from various sites, converting pdfs to text and extracting text from existing datasets. List has been mentioned in the excel sheet.

Data Deduplication:

Note: it is necessary to have your dataset as only chunks of text files for the deduplication codes to work. Also it is better to have large number of these text files for efficient deduplication, hence it is recommended to curate your data article wise (each article in a txt file).

hash.py

It finds all the text files in the given folder and generates their hash values using sim hash algorithm. These hash values are written in CSV files.

similarity_check.py

It basically does intra folder comparision of the hash values i.e. it will drop exact duplicates (retaining one instance) from each individual CSV which were created in the previous hash.py code.

copy_after.py

It copies the files remaining in the previous step to a specified folder for further processing.

minhash2_6.py -> from this code onwards, we start inter folder deduplication

It calculates min-hashes for the sim-hashes calculated in previous steps and loads these min-hashes into a model which will calculate nearset neighbours for each document. The file paths of the nearest neighbours (for each document) will be outputted into one or more csv files, depending on the number of near duplicates.

rem_filter2.py

Now the output of before code may have some false positives (files which are not near duplicates are also listed as near duplicates, this is the fault of the datasketch library we used, it is also clearly stated in their MinHash LSH documentation), this code will remove false positives from the previous code's output.

remove_7.py

This code will take the csvs outputted before (which contains paths of documents and their neighbours) and create a final list (in a txt file) of file paths to be removed from the dataset created after copy_after.py code.

finalremove.py

It takes the list created before and removes them from the dataset created after copy_after.py code. Deduplication process is completed after this code, you can compress your dataset into a csv or json file (place each article in a row).

Cleaning:

Note: The below codes assumes that all your data is spread in csvs and each row in any csv will be containing a single news article or similar text.

The file "build.json" has list of vulgar words for Telugu all other languages, feel free to update it if you find any words missing.

info_iden_csv.py

It lists down index of rows in which vulgar words, dates, contacts and personal information are present.

pre_process.py

Using the list in step 1 it seperates out the dataset csv in two different folders (one with good data and other with bad data).

info_iden_csv_new_dates.py

As we were also removing rows with dates in them, this code will again takes the list of vulgar words, dates, contacts, personal information and removes the items with dates in the list and creates a new list with vulgar words, contacts, personal information excluding dates. You can use either remove_phone_nos.py or pre_process.py to split the dataset CSV into two different folders using the new list. (Note: Use this code if you want to keep dates in your dataset).

antieng.py

It removes rows with english words which might not be related to the orignal text basing on a threshold, which can be set (default = 7, meaning rows with more than 7 English words will be separated into another csv).

detect_promotions.py

It detects tags, promotions, ads & note it down in a log txt file.

finalr.py

Removes the detected promotions & replace the links with <|hyperlink|> token.

drop_links.py

It check all rows again for any remaining links (which may be missed by finalr.py) and then drops those rows.

ignore_case_bw

Sometimes, certain inappropriate words appear in mixed-case letters, combining both lowercase and uppercase letters (specifically in English). This code is used as a final cleanup step to create new CSV files by removing rows containing such words.

Tokenization:

Note: The following codes assume that all your data, whether used to train the tokenizer or to test its fertility scores, is stored in CSV files, with each row in any CSV containing a single news article or similar text.

remove_emotes.py

This code removes characters from CSV files that are not in Telugu or English, ensuring these characters do not appear in the tokenizer's vocabulary.

tokenizer.py

This code is used for training the tokenizer and expects a folder containing CSV files of data. Refer to frequently used cmds.txt for the terminal command to run the code.

fertility_score.py

This code is used to calculate the fertility scores of the trained tokenizer.

tokenize_data.py

This code applies the tokenizer, trained in previous steps, to the text data. It expects CSV files with id and content columns and outputs CSV files with id, content, and tokens columns, where the tokens column contains the tokenizer encodings of content column.

add_eos_token.py

This code adds the EOS token to each row in the tokens column.

remove_unk.py

The `remove_unk.py` script is designed to handle unknown tokens (`unk tokens`) that may be present in the `tokens` column of CSV files. Unknown tokens can reduce the quality of the model's output if they are included in pre-training. This script removes the designated unknown token (`5` in our case) from the `tokens` column of the CSV files generated by the previous preprocessing steps.

The script processes CSV files and identifies the tokens column.
It scans through the tokens column, removing all occurrences of the unknown token (5).
The output is a cleaned version of the tokens column, free of the specified unknown token.

Example

A row in the tokens column before processing: [24, 535, 466, 35, 5, 454, 5656]
The row in the tokens column after processing: [24, 535, 466, 35, 454, 5656]

tokens_2048.py

This code splits the entries in the tokens column into segments of exactly 2048 tokens each and saves them in output CSV files with a single column. In each CSV, each row contains a list of exactly 2048 tokens.

convert_to_parquet.py

This code converts the CSV files obtained from the previous step into Parquet files.

merge_datasets.py

This code combines all the Parquet files from the previous step into a single Parquet file.

Pre-Training:

tt_splits.py

This code is used for making train-test splits.

model_config.py

This script sets up and initializes a Mistral-based causal language model with custom configurations, including vocabulary size, hidden layers, attention heads etc.

model_pretrain.py

This script handles the pretraining of the large language model (LLM).

Evaluation:

model_run.py

This script is used for text generation using the trained language model. It takes input prompts, processes them through the model, and generates coherent and contextually relevant text outputs.

manual_ppl_check.py

This script calculates the perplexity score of the language model, providing a measure of how well the model predicts a given dataset. Lower perplexity indicates better predictive performance.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Codes		Codes
SRIP Reports		SRIP Reports
README.md		README.md
Telugu_Scrape_Source.xlsx		Telugu_Scrape_Source.xlsx
frequently used cmds.txt		frequently used cmds.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Curation:

Data Deduplication:

Cleaning:

Tokenization:

Pre-Training:

Evaluation:

About

Uh oh!

Releases

Packages

Languages

pavandeekshith/Telugu-LLM

Folders and files

Latest commit

History

Repository files navigation

Data Curation:

Data Deduplication:

Cleaning:

Tokenization:

Pre-Training:

Evaluation:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages