Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[flake8]
max-line-length = 120
extend-ignore = E203, E265, F841, E712
111 changes: 111 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# AI Coding Agent Instructions for cs578-project3

## Project Overview
This is a medical record classification system that predicts patient smoking status (SMOKER, NON-SMOKER, UNKNOWN) from scrubbed medical text using Naive Bayes and K-Nearest Neighbor (kNN) algorithms. The project is structured as a CS578 machine learning course assignment with strict XML input format requirements.

## Architecture

### Module Structure
- **`project3/naiveBayes.py`**: Naive Bayes classifier with Bayesian/Dirichlet smoothing (615 lines)
- **`kNN/kNN.py`**: kNN classifier with multiple similarity/association functions (1245 lines)
- **`lib/persist.py`**: JSON persistence utilities (replaces legacy pickle format)
- **`tests/`**: Pytest-based smoke tests and fixture data
- **`scripts/convert_pickles_to_json.py`**: Migration tool for legacy `.p` files

### Data Flow
1. XML records parsed via regex into 3-tuples: `(Record ID, Label, Term String)`
2. Text preprocessing: strip numbers/punctuation using regex `[^0-9\.\-\s\/\:\,_\[\]\%]+`
3. Feature extraction: term frequencies, chi-square/dice coefficient for term ranking
4. Classification: Naive Bayes uses term probability; kNN uses vector similarity
5. Results written to output files (not test evaluated)

## Critical Patterns

### XML Input Format
```xml
<RECORD ID="123">
<SMOKING STATUS="SMOKER"></SMOKING>
<TEXT>patient medical notes...</TEXT>
</RECORD>
```
- Training data includes `<SMOKING STATUS>`, test data omits it
- Parse with regex patterns in `getTrainingSetTuples()` / `getTestSetTuples()`
- Record text content is space-delimited after preprocessing

### Persistence Migration
**CRITICAL**: This project is migrating from pickle (`.p`) to JSON format:
- Always use `lib.persist.save_json()` and `load_json()` for new code
- `kNN.py` functions `storeObject()`/`loadObject()` handle format detection:
- Prefer `.json` extension explicitly
- Fall back to pickle only for legacy compatibility
- Run `scripts/convert_pickles_to_json.py` to migrate old `.p` files
- README explicitly states: "standardizes term-ranking persistence on JSON files"

### Type Hints
- Project uses Python 3.9+ type hints extensively (`List`, `Dict`, `Tuple`, `Optional`)
- `mypy.ini` config: strict mode with `no_implicit_optional = True`, `warn_return_any = True`
- Test code has `ignore_errors = True` in mypy config

### Classification Labels
Three-class problem with hardcoded labels:
```python
["SMOKER", "NON-SMOKER", "UNKNOWN"]
```
Never modify these strings - they match XML attribute values.

### kNN Specific Patterns
- **Association functions**: `chi-square` (default) or `dice` for term-to-class relevance
- **Similarity functions**: `euclidean` (default), `manhattan`, `minkowski` for vector distance
- **Sampling strategies**: `Krandom` (default) or `topK` for neighbor selection
- Term rankings stored as nested dicts: `{classLabel: [(term, score), ...]}`

### Naive Bayes Specific Patterns
- **Smoothing**: Dirichlet (default, uses `mu` parameter) or Bayesian
- `mu` defaults to length of unique terms across all training documents
- Uses log probabilities to avoid underflow: `math.log()` throughout

## Developer Workflows

### Environment Setup
```bash
py -3 -m pip install -r requirements.txt pre-commit
pre-commit install
```

### Running Classifiers
```bash
# Naive Bayes
python -m project3.naiveBayes -t path/to/training.txt -c path/to/test.txt -m 500 -s

# kNN
python -m kNN.kNN -t path/to/training.txt -r termRankings.json -a chi-square -K 5 -s euclidean
```
Both modules are executable via `if __name__ == "__main__"` blocks with argparse.

### Testing
```bash
pytest # Run all tests
pre-commit run --all-files # Run black, mypy, flake8
```
- Tests are smoke tests (import validation), not functional tests
- Fixture data in `tests/fixtures/sample_records.txt`

### Code Quality
- **black**: Auto-formatting (enforced in pre-commit)
- **mypy**: Type checking with `--ignore-missing-imports` in pre-commit hook
- **flake8**: Style linting
- CI runs pre-commit checks on push/PR via GitHub Actions (`.github/workflows/pre-commit-ci.yml`)

## Common Pitfalls

1. **Persistence**: Don't use `pickle.dump()` directly - use `lib.persist` or `storeObject()` wrapper
2. **Regex parsing**: XML parsing uses fragile regex patterns - test against `tests/fixtures/sample_records.txt` format
3. **Feature vectors**: Multiple functions create term vectors with different schemas - check parameter signatures
4. **Error handling**: Most functions silently catch exceptions and print messages - no structured logging
5. **Windows paths**: Commented code shows Windows absolute paths (e.g., `C:\temp\datasets\`) - avoid hardcoding

## Key Files for Reference
- [README.md](../README.md) - CLI usage and XML schema
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Developer setup process
- [mypy.ini](../mypy.ini) - Type checking configuration
- [tests/fixtures/sample_records.txt](../tests/fixtures/sample_records.txt) - XML format examples
114 changes: 114 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
name: CI

on:
push:
branches: [ main, master, cleanup_and_standardization ]
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch cleanup_and_standardization appears to be a feature branch and should not be in the CI trigger list for pushes. Remove it to avoid running CI on every push to this temporary branch.

Suggested change
branches: [ main, master, cleanup_and_standardization ]
branches: [ main, master ]

Copilot uses AI. Check for mistakes.
pull_request:
branches: [ main, master ]

jobs:
install-deps:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Upgrade pip
run: pip install --upgrade pip
- name: Install system build deps
run: sudo apt-get update && sudo apt-get install -y build-essential libpq-dev
- name: Install Python deps
run: |
pip install -r requirements.txt
- name: Install dev tools
run: |
pip install flake8 black mypy

lint:
needs: install-deps
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Upgrade pip
run: pip install --upgrade pip
- name: Install flake8
run: pip install flake8
- name: Lint (flake8)
run: flake8

tests:
needs: install-deps
runs-on: ubuntu-latest
env:
PYTHONPATH: $GITHUB_WORKSPACE
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Export PYTHONPATH for job
run: |
echo "PYTHONPATH=$(pwd)" >> $GITHUB_ENV
- name: Install Python deps
run: |
pip install -r requirements.txt
- name: Install package (editable)
run: |
pip install -e .
- name: Run tests
run: |
pytest -q

format:
needs: install-deps
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Upgrade pip
run: pip install --upgrade pip
- name: Install black
run: pip install black
- name: Check formatting (black)
run: black --check .

types:
needs: install-deps
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Upgrade pip
run: pip install --upgrade pip
- name: Install mypy
run: pip install mypy
- name: Run mypy
run: mypy .
24 changes: 24 additions & 0 deletions .github/workflows/pre-commit-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: pre-commit checks

on:
push:
branches: [ main, master ]
pull_request:
branches: [ main, master ]

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt pre-commit
- name: Run pre-commit
run: |
pre-commit run --all-files
26 changes: 26 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
repos:
- repo: https://github.com/psf/black
rev: 25.11.0
hooks:
- id: black
language_version: python3

- repo: local
hooks:
- id: mypy
name: mypy (local)
entry: python -m mypy
language: system
types: [python]
args: ["--ignore-missing-imports"]

- repo: local
hooks:
- id: flake8
name: flake8 (local)
entry: python -m flake8
language: system
types: [python]
args: ["--max-line-length=160", "--extend-ignore=E203,E265,F841,E712"]


28 changes: 28 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Developer setup
----------------

Run the following to install developer tooling and enable pre-commit hooks locally:

1. Install dev tools (recommended to use a virtualenv):

```bash
py -3 -m pip install --upgrade pip
py -3 -m pip install -r requirements.txt pre-commit
```

2. Install the pre-commit hooks for this repository:

```bash
pre-commit install
```

3. To run hooks against all files (useful for CI or before your first commit):

```bash
pre-commit run --all-files
```

Notes:
- `black` enforces consistent formatting.
- `mypy` performs static type checks; the pre-commit mypy hook is configured with `--ignore-missing-imports`.
- `flake8` enforces basic style rules.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ optional arguments:
Path to the labeled medical record training set file,
e.g. ./path/to/training.txt
-r TERMRANKINGS, --termrankings TERMRANKINGS
Path to term rankings pickle file, e.g.
./path/to/termRankings.p
Path to term rankings JSON file, e.g.
./path/to/termRankings.json
-a ASSOCFUNC, --associationFunction ASSOCFUNC
The association function used to compare the relevancy
of a term to a specific class label [default=chi-
Expand All @@ -74,3 +74,8 @@ optional arguments:
distance similarity.

```

Note: This repository standardizes term-ranking persistence on JSON files (e.g. `termRankings.json`).
If you previously used legacy pickle `.p` files, convert them to JSON. The code prefers JSON and will
fall back to legacy formats only when explicitly required; in this cleanup branch we target JSON.

3 changes: 3 additions & 0 deletions kNN/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"""kNN package init"""

from . import kNN # noqa: F401 - exported for convenience
Loading