kingrichard2005 · kingrichard2005 · Dec 10, 2025 · Dec 10, 2025 · Jan 23, 2026 · Jan 23, 2026
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,3 @@
+[flake8]
+max-line-length = 120
+extend-ignore = E203, E265, F841, E712
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,111 @@
+# AI Coding Agent Instructions for cs578-project3
+
+## Project Overview
+This is a medical record classification system that predicts patient smoking status (SMOKER, NON-SMOKER, UNKNOWN) from scrubbed medical text using Naive Bayes and K-Nearest Neighbor (kNN) algorithms. The project is structured as a CS578 machine learning course assignment with strict XML input format requirements.
+
+## Architecture
+
+### Module Structure
+- **`project3/naiveBayes.py`**: Naive Bayes classifier with Bayesian/Dirichlet smoothing (615 lines)
+- **`kNN/kNN.py`**: kNN classifier with multiple similarity/association functions (1245 lines)
+- **`lib/persist.py`**: JSON persistence utilities (replaces legacy pickle format)
+- **`tests/`**: Pytest-based smoke tests and fixture data
+- **`scripts/convert_pickles_to_json.py`**: Migration tool for legacy `.p` files
+
+### Data Flow
+1. XML records parsed via regex into 3-tuples: `(Record ID, Label, Term String)`
+2. Text preprocessing: strip numbers/punctuation using regex `[^0-9\.\-\s\/\:\,_\[\]\%]+`
+3. Feature extraction: term frequencies, chi-square/dice coefficient for term ranking
+4. Classification: Naive Bayes uses term probability; kNN uses vector similarity
+5. Results written to output files (not test evaluated)
+
+## Critical Patterns
+
+### XML Input Format
+```xml
+<RECORD ID="123">
+  <SMOKING STATUS="SMOKER"></SMOKING>
+  <TEXT>patient medical notes...</TEXT>
+</RECORD>
+```
+- Training data includes `<SMOKING STATUS>`, test data omits it
+- Parse with regex patterns in `getTrainingSetTuples()` / `getTestSetTuples()`
+- Record text content is space-delimited after preprocessing
+
+### Persistence Migration
+**CRITICAL**: This project is migrating from pickle (`.p`) to JSON format:
+- Always use `lib.persist.save_json()` and `load_json()` for new code
+- `kNN.py` functions `storeObject()`/`loadObject()` handle format detection:
+  - Prefer `.json` extension explicitly
+  - Fall back to pickle only for legacy compatibility
+- Run `scripts/convert_pickles_to_json.py` to migrate old `.p` files
+- README explicitly states: "standardizes term-ranking persistence on JSON files"
+
+### Type Hints
+- Project uses Python 3.9+ type hints extensively (`List`, `Dict`, `Tuple`, `Optional`)
+- `mypy.ini` config: strict mode with `no_implicit_optional = True`, `warn_return_any = True`
+- Test code has `ignore_errors = True` in mypy config
+
+### Classification Labels
+Three-class problem with hardcoded labels:
+```python
+["SMOKER", "NON-SMOKER", "UNKNOWN"]
+```
+Never modify these strings - they match XML attribute values.
+
+### kNN Specific Patterns
+- **Association functions**: `chi-square` (default) or `dice` for term-to-class relevance
+- **Similarity functions**: `euclidean` (default), `manhattan`, `minkowski` for vector distance
+- **Sampling strategies**: `Krandom` (default) or `topK` for neighbor selection
+- Term rankings stored as nested dicts: `{classLabel: [(term, score), ...]}`
+
+### Naive Bayes Specific Patterns
+- **Smoothing**: Dirichlet (default, uses `mu` parameter) or Bayesian
+- `mu` defaults to length of unique terms across all training documents
+- Uses log probabilities to avoid underflow: `math.log()` throughout
+
+## Developer Workflows
+
+### Environment Setup
+```bash
+py -3 -m pip install -r requirements.txt pre-commit
+pre-commit install
+```
+
+### Running Classifiers
+```bash
+# Naive Bayes
+python -m project3.naiveBayes -t path/to/training.txt -c path/to/test.txt -m 500 -s
+
+# kNN
+python -m kNN.kNN -t path/to/training.txt -r termRankings.json -a chi-square -K 5 -s euclidean
+```
+Both modules are executable via `if __name__ == "__main__"` blocks with argparse.
+
+### Testing
+```bash
+pytest                          # Run all tests
+pre-commit run --all-files      # Run black, mypy, flake8
+```
+- Tests are smoke tests (import validation), not functional tests
+- Fixture data in `tests/fixtures/sample_records.txt`
+
+### Code Quality
+- **black**: Auto-formatting (enforced in pre-commit)
+- **mypy**: Type checking with `--ignore-missing-imports` in pre-commit hook
+- **flake8**: Style linting
+- CI runs pre-commit checks on push/PR via GitHub Actions (`.github/workflows/pre-commit-ci.yml`)
+
+## Common Pitfalls
+
+1. **Persistence**: Don't use `pickle.dump()` directly - use `lib.persist` or `storeObject()` wrapper
+2. **Regex parsing**: XML parsing uses fragile regex patterns - test against `tests/fixtures/sample_records.txt` format
+3. **Feature vectors**: Multiple functions create term vectors with different schemas - check parameter signatures
+4. **Error handling**: Most functions silently catch exceptions and print messages - no structured logging
+5. **Windows paths**: Commented code shows Windows absolute paths (e.g., `C:\temp\datasets\`) - avoid hardcoding
+
+## Key Files for Reference
+- [README.md](../README.md) - CLI usage and XML schema
+- [CONTRIBUTING.md](../CONTRIBUTING.md) - Developer setup process
+- [mypy.ini](../mypy.ini) - Type checking configuration
+- [tests/fixtures/sample_records.txt](../tests/fixtures/sample_records.txt) - XML format examples
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,114 @@
+name: CI
+
+on:
+  push:
+    branches: [ main, master, cleanup_and_standardization ]
-    branches: [ main, master, cleanup_and_standardization ]
+    branches: [ main, master ]
-    branches: [ main, master, cleanup_and_standardization ]
+    branches: [ main, master ]
+  pull_request:
+    branches: [ main, master ]
+
+jobs:
+  install-deps:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.9]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Upgrade pip
+        run: pip install --upgrade pip
+      - name: Install system build deps
+        run: sudo apt-get update && sudo apt-get install -y build-essential libpq-dev
+      - name: Install Python deps
+        run: |
+          pip install -r requirements.txt
+      - name: Install dev tools
+        run: |
+          pip install flake8 black mypy
+
+  lint:
+    needs: install-deps
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.9]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Upgrade pip
+        run: pip install --upgrade pip
+      - name: Install flake8
+        run: pip install flake8
+      - name: Lint (flake8)
+        run: flake8
+
+  tests:
+    needs: install-deps
+    runs-on: ubuntu-latest
+    env:
+      PYTHONPATH: $GITHUB_WORKSPACE
+    strategy:
+      matrix:
+        python-version: [3.9]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Export PYTHONPATH for job
+        run: |
+          echo "PYTHONPATH=$(pwd)" >> $GITHUB_ENV
+      - name: Install Python deps
+        run: |
+          pip install -r requirements.txt
+      - name: Install package (editable)
+        run: |
+          pip install -e .
+      - name: Run tests
+        run: |
+          pytest -q
+
+  format:
+    needs: install-deps
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.9]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Upgrade pip
+        run: pip install --upgrade pip
+      - name: Install black
+        run: pip install black
+      - name: Check formatting (black)
+        run: black --check .
+
+  types:
+    needs: install-deps
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.9]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Upgrade pip
+        run: pip install --upgrade pip
+      - name: Install mypy
+        run: pip install mypy
+      - name: Run mypy
+        run: mypy .
diff --git a/.github/workflows/pre-commit-ci.yml b/.github/workflows/pre-commit-ci.yml
@@ -0,0 +1,24 @@
+name: pre-commit checks
+
+on:
+  push:
+    branches: [ main, master ]
+  pull_request:
+    branches: [ main, master ]
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.9'
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt pre-commit
+      - name: Run pre-commit
+        run: |
+          pre-commit run --all-files
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,26 @@
+repos:
+  - repo: https://github.com/psf/black
+    rev: 25.11.0
+    hooks:
+      - id: black
+        language_version: python3
+
+  - repo: local
+    hooks:
+      - id: mypy
+        name: mypy (local)
+        entry: python -m mypy
+        language: system
+        types: [python]
+        args: ["--ignore-missing-imports"]
+
+  - repo: local
+    hooks:
+      - id: flake8
+        name: flake8 (local)
+        entry: python -m flake8
+        language: system
+        types: [python]
+        args: ["--max-line-length=160", "--extend-ignore=E203,E265,F841,E712"]
+
+
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,28 @@
+Developer setup
+----------------
+
+Run the following to install developer tooling and enable pre-commit hooks locally:
+
+1. Install dev tools (recommended to use a virtualenv):
+
+```bash
+py -3 -m pip install --upgrade pip
+py -3 -m pip install -r requirements.txt pre-commit
+```
+
+2. Install the pre-commit hooks for this repository:
+
+```bash
+pre-commit install
+```
+
+3. To run hooks against all files (useful for CI or before your first commit):
+
+```bash
+pre-commit run --all-files
+```
+
+Notes:
+- `black` enforces consistent formatting.
+- `mypy` performs static type checks; the pre-commit mypy hook is configured with `--ignore-missing-imports`.
+- `flake8` enforces basic style rules.
diff --git a/README.md b/README.md
@@ -54,8 +54,8 @@ optional arguments:
                         Path to the labeled medical record training set file,
                         e.g. ./path/to/training.txt
   -r TERMRANKINGS, --termrankings TERMRANKINGS
-                        Path to term rankings pickle file, e.g.
-                        ./path/to/termRankings.p
+                        Path to term rankings JSON file, e.g.
+                        ./path/to/termRankings.json
   -a ASSOCFUNC, --associationFunction ASSOCFUNC
                         The association function used to compare the relevancy
                         of a term to a specific class label [default=chi-
@@ -74,3 +74,8 @@ optional arguments:
                         distance similarity.
 
 ```
+
+Note: This repository standardizes term-ranking persistence on JSON files (e.g. `termRankings.json`).
+If you previously used legacy pickle `.p` files, convert them to JSON. The code prefers JSON and will
+fall back to legacy formats only when explicitly required; in this cleanup branch we target JSON.
+
diff --git a/kNN/__init__.py b/kNN/__init__.py
@@ -0,0 +1,3 @@
+"""kNN package init"""
+
+from . import kNN  # noqa: F401 - exported for convenience
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		"""kNN package init"""

		from . import kNN # noqa: F401 - exported for convenience