Skip to content

Cleanup and standardization#1

Open
kingrichard2005 wants to merge 7 commits intomasterfrom
cleanup_and_standardization
Open

Cleanup and standardization#1
kingrichard2005 wants to merge 7 commits intomasterfrom
cleanup_and_standardization

Conversation

@kingrichard2005
Copy link
Copy Markdown
Owner

This pull request introduces a comprehensive cleanup and standardization of the project, focusing on code quality, developer tooling, and data persistence. The main goals are to enforce consistent formatting and linting, migrate legacy pickle files to JSON for term rankings, and provide clear developer documentation and workflows. Automated CI and pre-commit hooks are now integrated to maintain code standards, and new tests and fixtures ensure basic module functionality.

Developer Tooling and CI Integration

  • Added pre-commit configuration (.pre-commit-config.yaml) for black, flake8, and mypy, and documented developer setup in CONTRIBUTING.md to enforce code style, formatting, and type checking locally. [1] [2]
  • Introduced GitHub Actions workflows for CI (.github/workflows/ci.yml) and pre-commit checks (.github/workflows/pre-commit-ci.yml) to automate linting, formatting, type checks, and tests on push/PR. [1] [2]

Persistence Standardization

  • Migrated term-ranking persistence from pickle .p files to JSON, updated documentation in README.md, and provided a conversion script scripts/convert_pickles_to_json.py to assist with legacy file migration. [1] [2] [3]

Documentation and Instructions

  • Added a comprehensive AI agent instruction file (.github/copilot-instructions.md) detailing project architecture, critical patterns, workflows, and common pitfalls for new contributors.

Testing and Fixtures

  • Added smoke and core tests for both classifiers (tests/test_knn_core.py, tests/test_naivebayes_smoke.py, tests/test_smoke.py) and a sample XML fixture (tests/fixtures/sample_records.txt) to validate parsing and basic functionality. [1] [2] [3] [4]

Configuration Updates

  • Added and updated configuration files for type checking (mypy.ini), flake8 linting (.flake8), and Python packaging (setup.cfg, pyproject.toml). [1] [2] [3] [4]

- Created requirements.txt with necessary dependencies.
- Added a script to convert legacy .p pickle files to JSON format.
- Introduced sample_records.txt fixture for testing.
- Implemented unit tests for kNN functionality including vector similarity and Hamming distance.
- Added smoke tests for Naive Bayes module and basic import checks.
…efactor kNN and Naive Bayes code for better error handling and import efficiency
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces comprehensive cleanup and standardization to the project, modernizing Python 2 code to Python 3, establishing automated code quality checks, and migrating from pickle to JSON for data persistence.

Changes:

  • Added pre-commit configuration and GitHub Actions CI workflows for automated code quality enforcement (formatting, linting, type checking)
  • Migrated term-ranking persistence from pickle .p files to JSON format with backward compatibility and a conversion script
  • Added developer documentation, AI agent instructions, smoke tests, and test fixtures

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_smoke.py Basic import validation test for module loading
tests/test_naivebayes_smoke.py Smoke tests for Naive Bayes parsing and prediction flow
tests/test_knn_core.py Core functionality tests for kNN similarity and sampling functions
tests/fixtures/sample_records.txt Sample XML fixture for testing record parsing
setup.cfg Python package metadata configuration
scripts/convert_pickles_to_json.py Utility script to convert legacy pickle files to JSON
pyproject.toml Build system configuration for setuptools
project3/naiveBayes.py Modernized Naive Bayes classifier with Python 3 syntax and type hints
project3/init.py Package initialization for project3 module
mypy.ini Type checking configuration with strict mode
kNN/kNN.py Modernized kNN classifier with JSON persistence support
kNN/init.py Package initialization for kNN module
README.md Updated documentation reflecting JSON persistence standard
CONTRIBUTING.md New developer setup and workflow documentation
.pre-commit-config.yaml Pre-commit hooks configuration for code quality tools
.github/workflows/pre-commit-ci.yml GitHub Actions workflow for pre-commit checks
.github/workflows/ci.yml Comprehensive CI pipeline with linting, testing, and type checking
.github/copilot-instructions.md AI agent guidance document for project architecture and patterns
.flake8 Flake8 linting configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +272
termClassOccurrenceLookup
if termClassOccurrenceLookup is not None
else globals().get("termClassOccurrenceLookup", {})
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using globals() as a fallback for optional parameters is fragile and makes the code harder to reason about. If termClassOccurrenceLookup is None, explicitly raise an error or require it as a mandatory parameter rather than searching in global scope.

Copilot uses AI. Check for mistakes.
Comment on lines +275 to +277
totalTermsInTrainingDocPerClass
if totalTermsInTrainingDocPerClass is not None
else globals().get("totalTermsInTrainingDocPerClass", {})
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above—using globals() as a fallback creates implicit dependencies on global state. Make this parameter required or document why global fallback is acceptable.

Copilot uses AI. Check for mistakes.
Comment on lines +426 to +435
(
termClassOccurrenceLookup
if termClassOccurrenceLookup is not None
else globals().get("termClassOccurrenceLookup", {})
),
(
totalTermsInTrainingDocPerClass
if totalTermsInTrainingDocPerClass is not None
else globals().get("totalTermsInTrainingDocPerClass", {})
),
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fourth occurrence of the globals() fallback pattern. Consider refactoring both generateLabeledRecordPredictions and generateUnlabeledRecordPredictions to require these parameters.

Suggested change
(
termClassOccurrenceLookup
if termClassOccurrenceLookup is not None
else globals().get("termClassOccurrenceLookup", {})
),
(
totalTermsInTrainingDocPerClass
if totalTermsInTrainingDocPerClass is not None
else globals().get("totalTermsInTrainingDocPerClass", {})
),
termClassOccurrenceLookup,
totalTermsInTrainingDocPerClass,

Copilot uses AI. Check for mistakes.
Comment on lines +594 to +595
if not top_n_KNN:
# no neighbors available, skip
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This early-exit changes the loop behavior compared to the original code which didn't guard against empty top_n_KNN. Consider adding a log message or ensuring this edge case is tested, as skipping records silently could mask data issues.

Copilot uses AI. Check for mistakes.
Comment on lines +597 to +600
labelScores: Dict[str, float] = {}
for label in classificationLabels:
# initialize per-label accumulator before iterating neighbors
labelScores[label] = 0.0
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code initialized labelScores inside the inner loop (line 772 in the old code), resetting it for each neighbor. The refactored version initializes it once per record and accumulates across neighbors, which fundamentally changes the algorithm logic and will produce incorrect results.

Copilot uses AI. Check for mistakes.
n_records = len(recordTuples)
if n_records == 0 or K <= 0:
return 0
kChunk = math.ceil(n_records / K)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.ceil returns a float in Python; for list slicing this should be cast to int: int(math.ceil(n_records / K)) or use integer division with ceiling: (n_records + K - 1) // K.

Suggested change
kChunk = math.ceil(n_records / K)
kChunk = (n_records + K - 1) // K

Copilot uses AI. Check for mistakes.

on:
push:
branches: [ main, master, cleanup_and_standardization ]
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch cleanup_and_standardization appears to be a feature branch and should not be in the CI trigger list for pushes. Remove it to avoid running CI on every push to this temporary branch.

Suggested change
branches: [ main, master, cleanup_and_standardization ]
branches: [ main, master ]

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants