Note: All the codes are in Codes folder.
Codes for scraping data from various sites, converting pdfs to text and extracting text from existing datasets. List has been mentioned in the excel sheet.
Note: it is necessary to have your dataset as only chunks of text files for the deduplication codes to work. Also it is better to have large number of these text files for efficient deduplication, hence it is recommended to curate your data article wise (each article in a txt file).
- hash.py
- similarity_check.py
- copy_after.py
- minhash2_6.py -> from this code onwards, we start inter folder deduplication
- rem_filter2.py
- remove_7.py
- finalremove.py
It finds all the text files in the given folder and generates their hash values using sim hash algorithm. These hash values are written in CSV files.
It basically does intra folder comparision of the hash values i.e. it will drop exact duplicates (retaining one instance) from each individual CSV which were created in the previous hash.py code.
It copies the files remaining in the previous step to a specified folder for further processing.
It calculates min-hashes for the sim-hashes calculated in previous steps and loads these min-hashes into a model which will calculate nearset neighbours for each document. The file paths of the nearest neighbours (for each document) will be outputted into one or more csv files, depending on the number of near duplicates.
Now the output of before code may have some false positives (files which are not near duplicates are also listed as near duplicates, this is the fault of the datasketch library we used, it is also clearly stated in their MinHash LSH documentation), this code will remove false positives from the previous code's output.
This code will take the csvs outputted before (which contains paths of documents and their neighbours) and create a final list (in a txt file) of file paths to be removed from the dataset created after copy_after.py code.
It takes the list created before and removes them from the dataset created after copy_after.py code. Deduplication process is completed after this code, you can compress your dataset into a csv or json file (place each article in a row).
Note: The below codes assumes that all your data is spread in csvs and each row in any csv will be containing a single news article or similar text.
The file "build.json" has list of vulgar words for Telugu all other languages, feel free to update it if you find any words missing.
- info_iden_csv.py
- pre_process.py
- info_iden_csv_new_dates.py
- antieng.py
- detect_promotions.py
- finalr.py
- drop_links.py
- ignore_case_bw
It lists down index of rows in which vulgar words, dates, contacts and personal information are present.
Using the list in step 1 it seperates out the dataset csv in two different folders (one with good data and other with bad data).
As we were also removing rows with dates in them, this code will again takes the list of vulgar words, dates, contacts, personal information and removes the items with dates in the list and creates a new list with vulgar words, contacts, personal information excluding dates. You can use either remove_phone_nos.py or pre_process.py to split the dataset CSV into two different folders using the new list. (Note: Use this code if you want to keep dates in your dataset).
It removes rows with english words which might not be related to the orignal text basing on a threshold, which can be set (default = 7, meaning rows with more than 7 English words will be separated into another csv).
It detects tags, promotions, ads & note it down in a log txt file.
Removes the detected promotions & replace the links with <|hyperlink|> token.
It check all rows again for any remaining links (which may be missed by finalr.py) and then drops those rows.
Sometimes, certain inappropriate words appear in mixed-case letters, combining both lowercase and uppercase letters (specifically in English). This code is used as a final cleanup step to create new CSV files by removing rows containing such words.
Note: The following codes assume that all your data, whether used to train the tokenizer or to test its fertility scores, is stored in CSV files, with each row in any CSV containing a single news article or similar text.
- remove_emotes.py
- tokenizer.py
- fertility_score.py
- tokenize_data.py
- add_eos_token.py
- remove_unk.py
- The script processes CSV files and identifies the
tokenscolumn. - It scans through the
tokenscolumn, removing all occurrences of the unknown token (5). - The output is a cleaned version of the
tokenscolumn, free of the specified unknown token. - A row in the
tokenscolumn before processing: [24, 535, 466, 35, 5, 454, 5656] - The row in the
tokenscolumn after processing: [24, 535, 466, 35, 454, 5656] - tokens_2048.py
- convert_to_parquet.py
- merge_datasets.py
This code removes characters from CSV files that are not in Telugu or English, ensuring these characters do not appear in the tokenizer's vocabulary.
This code is used for training the tokenizer and expects a folder containing CSV files of data. Refer to frequently used cmds.txt for the terminal command to run the code.
This code is used to calculate the fertility scores of the trained tokenizer.
This code applies the tokenizer, trained in previous steps, to the text data. It expects CSV files with id and content columns and outputs CSV files with id, content, and tokens columns, where the tokens column contains the tokenizer encodings of content column.
This code adds the EOS token to each row in the tokens column.
The `remove_unk.py` script is designed to handle unknown tokens (`unk tokens`) that may be present in the `tokens` column of CSV files. Unknown tokens can reduce the quality of the model's output if they are included in pre-training. This script removes the designated unknown token (`5` in our case) from the `tokens` column of the CSV files generated by the previous preprocessing steps.
Example
This code splits the entries in the tokens column into segments of exactly 2048 tokens each and saves them in output CSV files with a single column. In each CSV, each row contains a list of exactly 2048 tokens.
This code converts the CSV files obtained from the previous step into Parquet files.
This code combines all the Parquet files from the previous step into a single Parquet file.
- tt_splits.py
- model_config.py
- model_pretrain.py
This code is used for making train-test splits.
This script sets up and initializes a Mistral-based causal language model with custom configurations, including vocabulary size, hidden layers, attention heads etc.
This script handles the pretraining of the large language model (LLM).
- model_run.py
- manual_ppl_check.py
This script is used for text generation using the trained language model. It takes input prompts, processes them through the model, and generates coherent and contextually relevant text outputs.
This script calculates the perplexity score of the language model, providing a measure of how well the model predicts a given dataset. Lower perplexity indicates better predictive performance.