Run ./download.sh
Trust me, install and use uv it's the best python version and package manager
uv sync
uv venv
source .venv/bin/activateWhen using vscode ensure that you select the .venv python shell
To add new packages use:
uv add torch
This will install the package and add it the pyproject.toml which keeps us on the same package versions.
From Suggestion: Enron E-mail Classification The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: https://www.cs.cmu.edu/~enron/Links to an external site.. Can you classify the text of an e-mail message to decide who sent it?
Downlaod link: https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz
Classification on the Enron emails dataset. Try to determine who sent the email based on the contents of the email.
-
Write basic scripts (Python recommended) to:
- Iterate through email files
- Extract sender (author) and email body
- Handle basic parsing issues (headers, quoted text – decide on a consistent approach)
- Handle email-specific elements (signatures, disclaimers, forwarded content, reply chains)
-
Gather preliminary statistics:
- Number of unique senders.
- Distribution of emails per sender (identify potential class imbalance).
- Distribution of receivers.
- Average email length and distribution of email lengths.
-
Scope of the project Compare the performace of [Method A], [Method B] and [Method C] for classifying Enron emails by author.
-
Author Selection: Decide on the number of authors to include (e.g., top 10-20 most prolific senders to ensure enough data per class and manageability).
- Ensure a miniumum of 50 (jusitfy why)
- Consider temporal distribution (what is the timescale of the emails that we are taking maybe just one year of emails)
- Bredth of roles that we want to include or just have it be random? Justify it
-
Spesicfy the the Classification will happen from the Body + Subject of the email with recipient emails and sender info excluded.
-
High-Level Literature Review (Very Brief): Do some googling to find 2-3 papers or methods for "author attribution" or "email classification" which will be used for the "Related Work" section of the report.
- Rubric wants us to comapre 10+ methods. Let's comapre a couple in depth and then hand wave 6 or 7 of them with some grouped concerns or makes a table
- Need to include 10 methods and their strengths and weakness for this task:
- Traditional Stylometric Methods: Mendenhall's law, Yule's K statistic, sentence length analysis, function word analysis
- N-gram Based Approaches: Character n-grams, word n-grams, POS n-grams, skip-grams
- Linguistic Feature Methods: Syntactic patterns, vocabulary richness, readability indices, discourse markers
- Modern ML Approaches: SVM-based authorship, Random Forest, ensemble methods
- Deep Learning Methods: CNN for authorship, RNN/LSTM for sequential patterns, Transformer-based approaches, attention mechanisms
- Email-Specific Research: Email authorship challenges, informal text handling, multi-modal approaches
-
Data pre-processing
- Lowercasing.
- Punctuation removal. (might remove some signal for identifying authors by will be good to more easily compare methods).
- Number handling.
- Tokenization. (if need for methods)
- Stop-word removal. (reduces the size of the dataset)
- Stemming or Lemmatization ("running" -> "run").
- Email spsicif preprocessing.
- Signature detection and handling (remove vs keep as feature)
- Quote/reply text identification (remove forwarded content)
- Disclaimer removal (corporate boilerplate text)
- The goal of most of this data processing is to compare models abilities to look at the content of the messages being sent to classify authors, not how the messages are sent, although we can reconsider this.
- evaluate impact of each preprocessing step and justify it for the task and goal
-
Do data splitting
- Train/Val/Test splits (70/15/15)%
- Implement stratified splitting based on the author to ensure each set has a representative sample from each class.
- Save the splits in files
-
3 different methods to compare
- Method 1 (Traditional Baseline): TF-IDF + Naive Bayes or Logistic Regression.
- Theoretical Justification: Why Naive Bayes suits high-dimensional sparse text data, independence assumption validity, computational efficiency
- Method 2 (Traditional More Complex): TF-IDF + SVM or Random Forest.
- Theoretical Justification: SVM's ability to handle high-dimensional spaces, non-linear kernel benefits, margin maximization relevance
- Method 3 (Deep Learning - Sequential): Word Embeddings + LSTM/GRU or 1D CNN.
- Theoretical Justification: Sequential pattern recognition for writing style, attention mechanisms, bi-directional context benefits
- Method 1 (Traditional Baseline): TF-IDF + Naive Bayes or Logistic Regression.
-
Running models and Evaluating them
- Write code to implement the models
- Train models on training sets
- Use validation set to tune how models are trained (record what you noticed while train and changed)
- Training monitoring: loss curves, validation metrics over time
- Record training times
- Perform final evaluation on the models
- Record inference times
- Build graphs to compare model
- Precision, Recall, F1 (macro and micro)
- Top-k accuracy
- Confusion matrix
- Loss over time
- How do computational costs compare? (compare in iterations and cost per iteration)
- Which model performed best overall?
- On specific authors?
-
Write the Report (10-15 pages + summary/appendices):
- Title page
- Main Body:
- Introduction/Motivation: Problem, research question, relation to previous work.
- Problem significance: forensic applications, privacy concerns
- Clear research hypotheses and contribution statement
- Related Work: Briefly describe relevant existing approaches.
- COMPREHENSIVE: Cover 15+ methods with critical analysis of strengths/weaknesses
- Theoretical comparison between approaches and their relationships
- Evolution of authorship attribution methods
- Email-specific challenges in literature
- Data: Dataset details, pre-processing steps, challenges, data splits.
- Email-specific challenges: quoted text, signatures, informal language
- Class imbalance analysis and impact on methods
- Data quality issues and ethical considerations
- Methodology: Detailed explanation of each algorithm used, why chosen, implementation specifics, citations.
- Strong theoretical foundation for each method selection
- Implementation details and hyperparameter choices
- Alternative approaches considered and rejected
- Evaluation and Discussion: Present all results clearly (tables, charts). Compare models with each other and (if possible) published results. Discuss why models perform as they do, analyze errors, and compare computational demands.
- Statistical significance of performance differences
- Systematic failure case analysis with theoretical interpretation
- Computational cost analysis: training and inference time trade-offs
- Comparison with literature benchmarks where available
- Practical implications and deployment considerations
- Conclusions and Future Works: Summarize findings, objectives met, shortcomings, potential future investigations.
- Clear statement of objective achievement
- Limitations acknowledgment and future research directions
- Broader implications for authorship attribution field
- Executive summary (~1 page): Write this last. Briefly outline the problem, methods, and overall findings.
- Key findings and practical implications
- Computational cost vs. accuracy trade-offs
- Appendix: Contributions of each member (signed), optional code/extra results.
- Individual contributions (signed)
- Extended results: additional confusion matrices, visualizations
- Statistical test details and hyperparameter grids
-
Create the Video Presentation (~5 minutes):
- Brief overview of the problem, motivation, dataset (and challenges).
- Details of models evaluated (why selected, key considerations).
- Results highlights (overall results, interesting findings).
- Visuals: Simple, clear slides with key information, charts, and diagrams.
- Concise Summary: Focus on the most important aspects of your project and findings.