This project allows you to perform data cleaning, obtain statistics in the form of lists ranked by occurrences of letters of the alphabet, calculate Kendall's Tau distance, and Extended Kendall's Tau distance, statistics, and statistical tests on these results.
The project consists of several scripts, each handling different aspects of the data processing and analysis pipeline:
- Data Cleaning and Summary Statistics: Cleans provided
.txtfiles and converts them into summarized statistics. - Matrix Calculation: Generates matrices for Kendall Tau and Extended Kendall Tau calculations based on summarized data.
- Statistical Analysis: Calculates and aggregates statistics from the generated matrices.
This project requires the following Python libraries:
itertools: For efficient looping.re: For regular expression matching.ast: For safely evaluating strings containing Python expressions.collections: For high-performance container datatypes.numpy: For scientific computing with Python.openpyxl: For reading and writing Excel files.scipy: For scientific and technical computing.os: For interacting with the operating system.pandas: For data manipulation and analysis.
You can install most of these dependencies using pip (note: some of these come with Python standard library):
-
Data Preparation: Place your
.txtfiles in designated folders named after their respective categories (e.g., 'ES', 'PT', 'IN', etc.). -
Execution Steps:
- Run the
process_folderfunction for each folder containing.txtfiles to clean data and calculate preliminary statistics. - Use the
aggregatefunction to compile and summarize statistics across all folders. - Execute the
matrixfunction to fill Excel matrices for KT and EKT data analysis. - Copy data from outputkt.xlsx and outputekt.xlsx to previously formated outputkt_form.xlsx and outputekt_form.xlsx
- Perform statistical calculations using the
calculate_statisticsfunction for in-depth analysis.
- Run the
The main.py script orchestrates the project's workflow. Adjust the folder names and paths as necessary before execution.
Original data used in submited paper was provided in folders. Additional spreadsheets in Excel format can be found in the project's Root.
Contributions to improve the project are welcome. Please ensure to follow the project's coding standards and submit pull requests for any enhancements.
This project is released into the public domain and is free of licenses. It can be used, modified, and distributed without any restrictions. For more details, please refer to the Creative Commons CC0 declaration.