Cleanup and standardization by kingrichard2005 · Pull Request #1 · kingrichard2005/cs578-project3

kingrichard2005 · 2026-01-23T16:48:58Z

This pull request introduces a comprehensive cleanup and standardization of the project, focusing on code quality, developer tooling, and data persistence. The main goals are to enforce consistent formatting and linting, migrate legacy pickle files to JSON for term rankings, and provide clear developer documentation and workflows. Automated CI and pre-commit hooks are now integrated to maintain code standards, and new tests and fixtures ensure basic module functionality.

Developer Tooling and CI Integration

Added pre-commit configuration (.pre-commit-config.yaml) for black, flake8, and mypy, and documented developer setup in CONTRIBUTING.md to enforce code style, formatting, and type checking locally. [1] [2]
Introduced GitHub Actions workflows for CI (.github/workflows/ci.yml) and pre-commit checks (.github/workflows/pre-commit-ci.yml) to automate linting, formatting, type checks, and tests on push/PR. [1] [2]

Persistence Standardization

Migrated term-ranking persistence from pickle .p files to JSON, updated documentation in README.md, and provided a conversion script scripts/convert_pickles_to_json.py to assist with legacy file migration. [1] [2] [3]

Documentation and Instructions

Added a comprehensive AI agent instruction file (.github/copilot-instructions.md) detailing project architecture, critical patterns, workflows, and common pitfalls for new contributors.

Testing and Fixtures

Added smoke and core tests for both classifiers (tests/test_knn_core.py, tests/test_naivebayes_smoke.py, tests/test_smoke.py) and a sample XML fixture (tests/fixtures/sample_records.txt) to validate parsing and basic functionality. [1] [2] [3] [4]

Configuration Updates

Added and updated configuration files for type checking (mypy.ini), flake8 linting (.flake8), and Python packaging (setup.cfg, pyproject.toml). [1] [2] [3] [4]

- Created requirements.txt with necessary dependencies. - Added a script to convert legacy .p pickle files to JSON format. - Introduced sample_records.txt fixture for testing. - Implemented unit tests for kNN functionality including vector similarity and Hamming distance. - Added smoke tests for Naive Bayes module and basic import checks.

… improvements

…efactor kNN and Naive Bayes code for better error handling and import efficiency

Copilot

Pull request overview

This PR introduces comprehensive cleanup and standardization to the project, modernizing Python 2 code to Python 3, establishing automated code quality checks, and migrating from pickle to JSON for data persistence.

Changes:

Added pre-commit configuration and GitHub Actions CI workflows for automated code quality enforcement (formatting, linting, type checking)
Migrated term-ranking persistence from pickle .p files to JSON format with backward compatibility and a conversion script
Added developer documentation, AI agent instructions, smoke tests, and test fixtures

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_smoke.py	Basic import validation test for module loading
tests/test_naivebayes_smoke.py	Smoke tests for Naive Bayes parsing and prediction flow
tests/test_knn_core.py	Core functionality tests for kNN similarity and sampling functions
tests/fixtures/sample_records.txt	Sample XML fixture for testing record parsing
setup.cfg	Python package metadata configuration
scripts/convert_pickles_to_json.py	Utility script to convert legacy pickle files to JSON
pyproject.toml	Build system configuration for setuptools
project3/naiveBayes.py	Modernized Naive Bayes classifier with Python 3 syntax and type hints
project3/init.py	Package initialization for project3 module
mypy.ini	Type checking configuration with strict mode
kNN/kNN.py	Modernized kNN classifier with JSON persistence support
kNN/init.py	Package initialization for kNN module
README.md	Updated documentation reflecting JSON persistence standard
CONTRIBUTING.md	New developer setup and workflow documentation
.pre-commit-config.yaml	Pre-commit hooks configuration for code quality tools
.github/workflows/pre-commit-ci.yml	GitHub Actions workflow for pre-commit checks
.github/workflows/ci.yml	Comprehensive CI pipeline with linting, testing, and type checking
.github/copilot-instructions.md	AI agent guidance document for project architecture and patterns
.flake8	Flake8 linting configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kNN/kNN.py

project3/naiveBayes.py

.github/copilot-instructions.md

Copilot

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kNN/kNN.py

project3/naiveBayes.py

kNN/kNN.py

Copilot

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T03:25:15Z

project3/naiveBayes.py

+                            termClassOccurrenceLookup
+                            if termClassOccurrenceLookup is not None
+                            else globals().get("termClassOccurrenceLookup", {})


Using globals() as a fallback for optional parameters is fragile and makes the code harder to reason about. If termClassOccurrenceLookup is None, explicitly raise an error or require it as a mandatory parameter rather than searching in global scope.

Copilot · 2026-01-28T03:25:15Z

project3/naiveBayes.py

+                            totalTermsInTrainingDocPerClass
+                            if totalTermsInTrainingDocPerClass is not None
+                            else globals().get("totalTermsInTrainingDocPerClass", {})


Same issue as above—using globals() as a fallback creates implicit dependencies on global state. Make this parameter required or document why global fallback is acceptable.

project3/naiveBayes.py

Copilot · 2026-01-28T03:25:16Z

project3/naiveBayes.py

+                        (
+                            termClassOccurrenceLookup
+                            if termClassOccurrenceLookup is not None
+                            else globals().get("termClassOccurrenceLookup", {})
+                        ),
+                        (
+                            totalTermsInTrainingDocPerClass
+                            if totalTermsInTrainingDocPerClass is not None
+                            else globals().get("totalTermsInTrainingDocPerClass", {})
+                        ),


Fourth occurrence of the globals() fallback pattern. Consider refactoring both generateLabeledRecordPredictions and generateUnlabeledRecordPredictions to require these parameters.

Suggested change

(

termClassOccurrenceLookup

if termClassOccurrenceLookup is not None

else globals().get("termClassOccurrenceLookup", {})

),

(

totalTermsInTrainingDocPerClass

if totalTermsInTrainingDocPerClass is not None

else globals().get("totalTermsInTrainingDocPerClass", {})

),

termClassOccurrenceLookup,

totalTermsInTrainingDocPerClass,

Copilot · 2026-01-28T03:25:16Z

kNN/kNN.py

+            if not top_n_KNN:
+                # no neighbors available, skip


This early-exit changes the loop behavior compared to the original code which didn't guard against empty top_n_KNN. Consider adding a log message or ensuring this edge case is tested, as skipping records silently could mask data issues.

Copilot · 2026-01-28T03:25:16Z

kNN/kNN.py

+            labelScores: Dict[str, float] = {}
            for label in classificationLabels:
+                # initialize per-label accumulator before iterating neighbors
+                labelScores[label] = 0.0


The original code initialized labelScores inside the inner loop (line 772 in the old code), resetting it for each neighbor. The refactored version initializes it once per record and accumulates across neighbors, which fundamentally changes the algorithm logic and will produce incorrect results.

Copilot · 2026-01-28T03:25:17Z

kNN/kNN.py

+        n_records = len(recordTuples)
+        if n_records == 0 or K <= 0:
+            return 0
+        kChunk = math.ceil(n_records / K)


math.ceil returns a float in Python; for list slicing this should be cast to int: int(math.ceil(n_records / K)) or use integer division with ceiling: (n_records + K - 1) // K.

Suggested change

kChunk = math.ceil(n_records / K)

kChunk = (n_records + K - 1) // K

Copilot · 2026-01-28T03:25:17Z

.github/workflows/ci.yml

+
+on:
+  push:
+    branches: [ main, master, cleanup_and_standardization ]


The branch cleanup_and_standardization appears to be a feature branch and should not be in the CI trigger list for pushes. Remove it to avoid running CI on every push to this temporary branch.

Suggested change

branches: [ main, master, cleanup_and_standardization ]

branches: [ main, master ]

kingrichard2005 added 4 commits December 9, 2025 20:52

Refactor kNN and Naive Bayes code for type hinting and error handling…

f7cf675

… improvements

added initial copilot-instructions

322d607

Enhance CI configuration, improve Python dependency management, and r…

cdd25d0

…efactor kNN and Naive Bayes code for better error handling and import efficiency

kingrichard2005 requested a review from Copilot January 23, 2026 16:48

Copilot AI reviewed Jan 23, 2026

View reviewed changes

kingrichard2005 added 2 commits January 26, 2026 19:22

updates from collab pr feedback

0702b00

fixed format errors from ci pipeline feedback

87fb7be

kingrichard2005 requested a review from Copilot January 27, 2026 03:27

Copilot AI reviewed Jan 27, 2026

View reviewed changes

more updates from collab pr feedback

675b9a9

kingrichard2005 requested a review from Copilot January 28, 2026 03:24

Copilot AI reviewed Jan 28, 2026

View reviewed changes

	kChunk = math.ceil(n_records / K)
	kChunk = (n_records + K - 1) // K

	branches: [ main, master, cleanup_and_standardization ]
	branches: [ main, master ]

Conversation

kingrichard2005 commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants