Author: Tejas Khadloya
Project: Employee Sentiment Analysis
Dataset: test.xlsx
Date: July 14, 2025
This report presents a comprehensive end-to-end analysis of employee sentiment and engagement using an internal communications dataset. The analysis workflow includes robust sentiment labeling, exploratory data analysis, score aggregation, employee ranking and risk identification, and predictive modeling with fully documented methods and observations.
- Data was loaded from
test.xlsxcontaining columns: Subject, Body, Date, and From. - Inspected for missing values and correct data types—no significant omissions found.
- Data structure was reviewed to ensure a reliable foundation for all following analysis steps.
- VADER sentiment analyzer was used as the primary model for labeling messages as Positive, Negative, or Neutral.
- Thresholds:
- Compound score ≥ 0.05 → Positive
- Compound score ≤ –0.05 → Negative
- Otherwise → Neutral
- Justification: Chosen thresholds align with VADER defaults and have been checked for appropriateness using sample inspection.
- Thresholds:
- TextBlob was used in parallel for additional validation.
- Model agreement statistics: 66.91% of messages received matching labels from both models.
- Differences and reasoning were fully documented.
- The final sentiment label for each message is based on VADER, ensuring both robustness and reproducibility.
- Examined data structure, types, and missing values.
- Counted and visualized sentiment label distribution.
- Visualized monthly sentiment trends.
- Performed boxplot analysis of message length across sentiment categories.
- Listed top 10 most active senders.
- Detected outlier months with unusually high messaging or sentiment activity.
Interpretations were provided after every visualization, highlighting their business relevance and guiding next steps.
- Each communication:
- Positive: +1
- Negative: -1
- Neutral: 0
- Monthly scores are the sum per employee, resetting every new month.
- Results are structured for downstream ranking and risk analysis.
- For each month:
- Top 3 Positive Employees: Highest scores, ranked by score (descending) then alphabetically.
- Top 3 Negative Employees: Lowest scores, ranked by score (ascending) then alphabetically.
- All-month rankings are exported to Excel for transparency.
- Bar charts visually highlight exceptional positive contributors and distributed or concentrated negative sentiment.
- Narrative explanations accompany all outputs, making the findings actionable.
- Flight risk is defined as any employee sending 4+ negative emails within any rolling 30-day window, per project instructions.
- Each flagged employee appears only once, even if multiple risk windows exist.
- Results are printed and saved for HR/leadership review.
- Linear regression predicts each employee's monthly sentiment score using these features:
- Message count
- Total length
- Average length
- Total words
- Average words
- The dataset is split 80/20 for train/test evaluation.
- Model metrics:
- R² Score: 0.72 (Strong real-world predictive power; model explains 72% of variance in monthly scores)
- Mean Absolute Error (MAE): 1.40 (On average, predictions are 1.4 sentiment points from actual)
- Actual vs. predicted plots and interpretation illustrate model fit and highlight occasional outliers.
- Coefficient table shows message count is the dominant driver of positive sentiment; other features have weaker or marginal effects.
- Majority of messages: Positive (1,528)
- Neutral messages: 511
- Negative messages: 152
Interpretation: Communication climate is generally positive, with some evidence of disengagement/complaint.
- Employee sentiment is stably positive month-to-month.
- Neutral/negative messages did not display persistent spikes, but rare months with high activity were flagged for context-specific review. Interpretation: No overwhelming periods of negativity; most changes seemed minor and periodic.
- Positive and negative messages tend to be longer than neutral ones.
- Negative outliers are unusually verbose, potentially signifying detailed complaints. Interpretation: Brevity in neutral communications may indicate disengagement; verbosity in negativity could reflect greater issue detail.
- Most active senders are central to communication patterns and potentially to company morale. Interpretation: Engagement monitoring should prioritize these individuals and watch for shifts in their communication style or frequency.
| Rank | Employee Email | Sentiment Score | Category |
|---|---|---|---|
| 1 | kayne.coulter@enron.com | 13 | Positive |
| 2 | eric.bass@enron.com | 9 | Positive |
| 3 | lydia.delgado@enron.com | 9 | Positive |
| 1 | bobette.riner@ipgdirect.com | 1 | Negative |
| 2 | johnny.palmer@enron.com | 1 | Negative |
| 3 | rhonda.denton@enron.com | 1 | Negative |
Interpretation: January 2010 saw strong positive contributions from the top three, especially one clear standout. Negative scores were tied and low—negative sentiment was distributed, not concentrated.
- Identified (based on 4+ negative emails within any 30-day rolling window):
Interpretation: These individuals warrant proactive engagement and potential HR support.
- R² Score: 0.72
- MAE: 1.40
Feature Importances (Coefficients):
| Feature | Coefficient | Interpretation |
|---|---|---|
| message_count | 0.57 | Most predictive; more monthly activity = higher sentiment |
| total_words | 0.01 | Minor positive influence |
| avg_length | 0.00 | Small positive effect |
| avg_words | -0.00 | Very weak negative, not predictive beyond message_count |
| total_length | -0.00 | Negligible |
Interpretation: Frequent, regular communication is a strong signal of higher sentiment, while just writing longer or wordier emails isn't by itself a predictor of improved attitude or engagement.
- Bar chart: Sentiment label distribution.
- Stacked bar chart: Monthly sentiment trends.
- Boxplot: Message length per sentiment.
- Grouped bar chart: Top 3 positive/negative employees each month.
- Scatter plot: Actual vs. predicted sentiment scores (model fit).
- Excel: Monthly employee rankings and flight risk employee lists.
- Proactive HR Engagement: Intervene early with employees flagged as flight risks.
- Recognize/Support Top Performers: Publicly acknowledge consistently positive employees; support those with ongoing negative sentiment.
- Ongoing Monitoring: Repeat sentiment and ranking analysis on a regular schedule.
- Model Improvement: Consider using richer text-based features or additional metadata for even better predictive accuracy.
- All steps are sequentially titled, human-commented, and include rationale and technical logic.
- Every major section’s output is clearly spaced and interpreted, ensuring readability and stakeholder utility.
- The pipeline allows any stakeholder to follow the narrative from data to findings to recommended action without reference to the code alone.
LLM-Final-Assessment.ipynb— Step-by-step code and detailed narrative, ready for review or rerun.visualization/folder — All key charts, plots, and Excel outputs for visualization and monthly rankings.- This
README.md— Summary of findings, outcome tables, and recommendations.
- Open and run the notebook/cell file (
LLM-Final-Assessment.ipynb). - Review the in-line outputs and observations after each section.
- Examine the visualizations and exported Excel/CSV files for reference and leadership review.
- Use the interpretation comments to shape next steps or HR interventions.
End of Report & README