Conversation
- Created requirements.txt with necessary dependencies. - Added a script to convert legacy .p pickle files to JSON format. - Introduced sample_records.txt fixture for testing. - Implemented unit tests for kNN functionality including vector similarity and Hamming distance. - Added smoke tests for Naive Bayes module and basic import checks.
…efactor kNN and Naive Bayes code for better error handling and import efficiency
There was a problem hiding this comment.
Pull request overview
This PR introduces comprehensive cleanup and standardization to the project, modernizing Python 2 code to Python 3, establishing automated code quality checks, and migrating from pickle to JSON for data persistence.
Changes:
- Added pre-commit configuration and GitHub Actions CI workflows for automated code quality enforcement (formatting, linting, type checking)
- Migrated term-ranking persistence from pickle
.pfiles to JSON format with backward compatibility and a conversion script - Added developer documentation, AI agent instructions, smoke tests, and test fixtures
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_smoke.py | Basic import validation test for module loading |
| tests/test_naivebayes_smoke.py | Smoke tests for Naive Bayes parsing and prediction flow |
| tests/test_knn_core.py | Core functionality tests for kNN similarity and sampling functions |
| tests/fixtures/sample_records.txt | Sample XML fixture for testing record parsing |
| setup.cfg | Python package metadata configuration |
| scripts/convert_pickles_to_json.py | Utility script to convert legacy pickle files to JSON |
| pyproject.toml | Build system configuration for setuptools |
| project3/naiveBayes.py | Modernized Naive Bayes classifier with Python 3 syntax and type hints |
| project3/init.py | Package initialization for project3 module |
| mypy.ini | Type checking configuration with strict mode |
| kNN/kNN.py | Modernized kNN classifier with JSON persistence support |
| kNN/init.py | Package initialization for kNN module |
| README.md | Updated documentation reflecting JSON persistence standard |
| CONTRIBUTING.md | New developer setup and workflow documentation |
| .pre-commit-config.yaml | Pre-commit hooks configuration for code quality tools |
| .github/workflows/pre-commit-ci.yml | GitHub Actions workflow for pre-commit checks |
| .github/workflows/ci.yml | Comprehensive CI pipeline with linting, testing, and type checking |
| .github/copilot-instructions.md | AI agent guidance document for project architecture and patterns |
| .flake8 | Flake8 linting configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| termClassOccurrenceLookup | ||
| if termClassOccurrenceLookup is not None | ||
| else globals().get("termClassOccurrenceLookup", {}) |
There was a problem hiding this comment.
Using globals() as a fallback for optional parameters is fragile and makes the code harder to reason about. If termClassOccurrenceLookup is None, explicitly raise an error or require it as a mandatory parameter rather than searching in global scope.
| totalTermsInTrainingDocPerClass | ||
| if totalTermsInTrainingDocPerClass is not None | ||
| else globals().get("totalTermsInTrainingDocPerClass", {}) |
There was a problem hiding this comment.
Same issue as above—using globals() as a fallback creates implicit dependencies on global state. Make this parameter required or document why global fallback is acceptable.
| ( | ||
| termClassOccurrenceLookup | ||
| if termClassOccurrenceLookup is not None | ||
| else globals().get("termClassOccurrenceLookup", {}) | ||
| ), | ||
| ( | ||
| totalTermsInTrainingDocPerClass | ||
| if totalTermsInTrainingDocPerClass is not None | ||
| else globals().get("totalTermsInTrainingDocPerClass", {}) | ||
| ), |
There was a problem hiding this comment.
Fourth occurrence of the globals() fallback pattern. Consider refactoring both generateLabeledRecordPredictions and generateUnlabeledRecordPredictions to require these parameters.
| ( | |
| termClassOccurrenceLookup | |
| if termClassOccurrenceLookup is not None | |
| else globals().get("termClassOccurrenceLookup", {}) | |
| ), | |
| ( | |
| totalTermsInTrainingDocPerClass | |
| if totalTermsInTrainingDocPerClass is not None | |
| else globals().get("totalTermsInTrainingDocPerClass", {}) | |
| ), | |
| termClassOccurrenceLookup, | |
| totalTermsInTrainingDocPerClass, |
| if not top_n_KNN: | ||
| # no neighbors available, skip |
There was a problem hiding this comment.
This early-exit changes the loop behavior compared to the original code which didn't guard against empty top_n_KNN. Consider adding a log message or ensuring this edge case is tested, as skipping records silently could mask data issues.
| labelScores: Dict[str, float] = {} | ||
| for label in classificationLabels: | ||
| # initialize per-label accumulator before iterating neighbors | ||
| labelScores[label] = 0.0 |
There was a problem hiding this comment.
The original code initialized labelScores inside the inner loop (line 772 in the old code), resetting it for each neighbor. The refactored version initializes it once per record and accumulates across neighbors, which fundamentally changes the algorithm logic and will produce incorrect results.
| n_records = len(recordTuples) | ||
| if n_records == 0 or K <= 0: | ||
| return 0 | ||
| kChunk = math.ceil(n_records / K) |
There was a problem hiding this comment.
math.ceil returns a float in Python; for list slicing this should be cast to int: int(math.ceil(n_records / K)) or use integer division with ceiling: (n_records + K - 1) // K.
| kChunk = math.ceil(n_records / K) | |
| kChunk = (n_records + K - 1) // K |
|
|
||
| on: | ||
| push: | ||
| branches: [ main, master, cleanup_and_standardization ] |
There was a problem hiding this comment.
The branch cleanup_and_standardization appears to be a feature branch and should not be in the CI trigger list for pushes. Remove it to avoid running CI on every push to this temporary branch.
| branches: [ main, master, cleanup_and_standardization ] | |
| branches: [ main, master ] |
This pull request introduces a comprehensive cleanup and standardization of the project, focusing on code quality, developer tooling, and data persistence. The main goals are to enforce consistent formatting and linting, migrate legacy pickle files to JSON for term rankings, and provide clear developer documentation and workflows. Automated CI and pre-commit hooks are now integrated to maintain code standards, and new tests and fixtures ensure basic module functionality.
Developer Tooling and CI Integration
.pre-commit-config.yaml) forblack,flake8, andmypy, and documented developer setup inCONTRIBUTING.mdto enforce code style, formatting, and type checking locally. [1] [2].github/workflows/ci.yml) and pre-commit checks (.github/workflows/pre-commit-ci.yml) to automate linting, formatting, type checks, and tests on push/PR. [1] [2]Persistence Standardization
.pfiles to JSON, updated documentation inREADME.md, and provided a conversion scriptscripts/convert_pickles_to_json.pyto assist with legacy file migration. [1] [2] [3]Documentation and Instructions
.github/copilot-instructions.md) detailing project architecture, critical patterns, workflows, and common pitfalls for new contributors.Testing and Fixtures
tests/test_knn_core.py,tests/test_naivebayes_smoke.py,tests/test_smoke.py) and a sample XML fixture (tests/fixtures/sample_records.txt) to validate parsing and basic functionality. [1] [2] [3] [4]Configuration Updates
mypy.ini), flake8 linting (.flake8), and Python packaging (setup.cfg,pyproject.toml). [1] [2] [3] [4]