Fix: Implemented Label Encoding and Unified ML Models using Pipelines #139

itsvishwasj · 2025-10-26T06:56:55Z

Fix: Implemented Label Encoding and Unified ML Models using Pipelines

🛠️ Pull Request Template

🏷️ PR Type

🐞 Bug Fix
🛠️ Improvement / Refactor

🔗 Related Issue

Closes Training pipeline breaks without Label Encoding and integrated TF-IDF #68

📝 Rationale / Motivation

This PR fully resolves Issue #68, which reported errors and instability in the classical machine learning model training workflow.

String Label Error: Fixed runtime errors caused by passing non-numerical string labels (e.g., "half-true") directly to scikit-learn classifiers. This is fixed by applying LabelEncoder.
Preprocessing Duplication: Fixed inconsistent and redundant $\text{TfidfVectorizer}$ steps by bundling the vectorizer and the classifier into a unified Pipeline object.

The changes significantly improve stability, data consistency, and code clarity.

✨ Description of Changes

Core Fixes Applied:
- Label Encoding: Implemented sklearn.preprocessing.LabelEncoder to convert the 6 unique string labels to integers across all affected files.
- Pipeline Integration: Refactored training logic to use sklearn.pipeline.Pipeline, combining $\text{TfidfVectorizer}$ with the classifier (NB, LR, RF) in a single object.
Files Modified:
- scripts/fake_news_logreg_rf.py: Implemented Pipelines for $\text{LogisticRegression}$, $\text{RandomForestClassifier}$, and $\text{MultinomialNB}$.
- module/liar-data-analysis.py: Implemented Label Encoding and Pipelines for analysis examples.
- module/fake-news-detection-using-nb.ipynb: Applied Label Encoding and the $\text{MultinomialNB}$ Pipeline logic.

🧪 Testing Instructions

Pull this branch and ensure your dependencies are installed.
Run the main comparison script:
```
python scripts/fake_news_logreg_rf.py
```
Expected Results:
- The script must execute without crashing (no more errors about string labels).
- Accuracy scores and classification reports for all three models (NB, LR, RF) should be printed to the console.
- New or updated result files (.md, confusion matrix PNGs) should be generated in the results/ directory.

👀 Impact Assessment

Stability: High impact, resolving core ML runtime errors and inconsistencies reported in Training pipeline breaks without Label Encoding and integrated TF-IDF #68.
Maintainability: Improves code cleanliness by eliminating redundant vectorization steps.

⚡ Checklist

Code follows project’s coding style and guidelines
Changes are tested locally
Automated tests added/updated (N/A)
Documentation updated (N/A)
User-facing changes are documented (N/A)
Related issue linked
No new warnings/errors introduced

⚠️ Breaking Changes

None. This is an internal fix that preserves the input/output of the scripts.

🎯 Priority / Impact Level

Priority: High (Critical bug fix)
Impact: Medium (Internal code structure)

…r classical ML models (closes #<issue-number>)

vercel · 2025-10-26T06:57:00Z

@itsvishwasj is attempting to deploy a commit to the deepika's projects Team on Vercel.

A member of the Team first needs to authorize it.

Fix: Implemented Label Encoding and unified TF-IDF using Pipelines fo…

1ed65e4

…r classical ML models (closes #<issue-number>)

itsvishwasj marked this pull request as draft October 26, 2025 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Implemented Label Encoding and Unified ML Models using Pipelines #139

Fix: Implemented Label Encoding and Unified ML Models using Pipelines #139

Uh oh!

itsvishwasj commented Oct 26, 2025

Uh oh!

vercel bot commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: Implemented Label Encoding and Unified ML Models using Pipelines #139

Are you sure you want to change the base?

Fix: Implemented Label Encoding and Unified ML Models using Pipelines #139

Uh oh!

Conversation

itsvishwasj commented Oct 26, 2025