-
Notifications
You must be signed in to change notification settings - Fork 0
Cleanup and standardization #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kingrichard2005
wants to merge
7
commits into
master
Choose a base branch
from
cleanup_and_standardization
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
024dea6
Add requirements, conversion script, and tests for kNN and Naive Bayes
kingrichard2005 f7cf675
Refactor kNN and Naive Bayes code for type hinting and error handling…
kingrichard2005 322d607
added initial copilot-instructions
kingrichard2005 cdd25d0
Enhance CI configuration, improve Python dependency management, and r…
kingrichard2005 0702b00
updates from collab pr feedback
kingrichard2005 87fb7be
fixed format errors from ci pipeline feedback
kingrichard2005 675b9a9
more updates from collab pr feedback
kingrichard2005 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| [flake8] | ||
| max-line-length = 120 | ||
| extend-ignore = E203, E265, F841, E712 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| # AI Coding Agent Instructions for cs578-project3 | ||
|
|
||
| ## Project Overview | ||
| This is a medical record classification system that predicts patient smoking status (SMOKER, NON-SMOKER, UNKNOWN) from scrubbed medical text using Naive Bayes and K-Nearest Neighbor (kNN) algorithms. The project is structured as a CS578 machine learning course assignment with strict XML input format requirements. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Module Structure | ||
| - **`project3/naiveBayes.py`**: Naive Bayes classifier with Bayesian/Dirichlet smoothing (615 lines) | ||
| - **`kNN/kNN.py`**: kNN classifier with multiple similarity/association functions (1245 lines) | ||
| - **`lib/persist.py`**: JSON persistence utilities (replaces legacy pickle format) | ||
| - **`tests/`**: Pytest-based smoke tests and fixture data | ||
| - **`scripts/convert_pickles_to_json.py`**: Migration tool for legacy `.p` files | ||
|
|
||
| ### Data Flow | ||
| 1. XML records parsed via regex into 3-tuples: `(Record ID, Label, Term String)` | ||
| 2. Text preprocessing: strip numbers/punctuation using regex `[^0-9\.\-\s\/\:\,_\[\]\%]+` | ||
| 3. Feature extraction: term frequencies, chi-square/dice coefficient for term ranking | ||
| 4. Classification: Naive Bayes uses term probability; kNN uses vector similarity | ||
| 5. Results written to output files (not test evaluated) | ||
|
|
||
| ## Critical Patterns | ||
|
|
||
| ### XML Input Format | ||
| ```xml | ||
| <RECORD ID="123"> | ||
| <SMOKING STATUS="SMOKER"></SMOKING> | ||
| <TEXT>patient medical notes...</TEXT> | ||
| </RECORD> | ||
| ``` | ||
| - Training data includes `<SMOKING STATUS>`, test data omits it | ||
| - Parse with regex patterns in `getTrainingSetTuples()` / `getTestSetTuples()` | ||
| - Record text content is space-delimited after preprocessing | ||
|
|
||
| ### Persistence Migration | ||
| **CRITICAL**: This project is migrating from pickle (`.p`) to JSON format: | ||
| - Always use `lib.persist.save_json()` and `load_json()` for new code | ||
| - `kNN.py` functions `storeObject()`/`loadObject()` handle format detection: | ||
| - Prefer `.json` extension explicitly | ||
| - Fall back to pickle only for legacy compatibility | ||
| - Run `scripts/convert_pickles_to_json.py` to migrate old `.p` files | ||
| - README explicitly states: "standardizes term-ranking persistence on JSON files" | ||
|
|
||
| ### Type Hints | ||
| - Project uses Python 3.9+ type hints extensively (`List`, `Dict`, `Tuple`, `Optional`) | ||
| - `mypy.ini` config: strict mode with `no_implicit_optional = True`, `warn_return_any = True` | ||
| - Test code has `ignore_errors = True` in mypy config | ||
|
|
||
| ### Classification Labels | ||
| Three-class problem with hardcoded labels: | ||
| ```python | ||
| ["SMOKER", "NON-SMOKER", "UNKNOWN"] | ||
| ``` | ||
| Never modify these strings - they match XML attribute values. | ||
|
|
||
| ### kNN Specific Patterns | ||
| - **Association functions**: `chi-square` (default) or `dice` for term-to-class relevance | ||
| - **Similarity functions**: `euclidean` (default), `manhattan`, `minkowski` for vector distance | ||
| - **Sampling strategies**: `Krandom` (default) or `topK` for neighbor selection | ||
| - Term rankings stored as nested dicts: `{classLabel: [(term, score), ...]}` | ||
|
|
||
| ### Naive Bayes Specific Patterns | ||
| - **Smoothing**: Dirichlet (default, uses `mu` parameter) or Bayesian | ||
| - `mu` defaults to length of unique terms across all training documents | ||
| - Uses log probabilities to avoid underflow: `math.log()` throughout | ||
|
|
||
| ## Developer Workflows | ||
|
|
||
| ### Environment Setup | ||
| ```bash | ||
| py -3 -m pip install -r requirements.txt pre-commit | ||
| pre-commit install | ||
| ``` | ||
|
|
||
| ### Running Classifiers | ||
| ```bash | ||
| # Naive Bayes | ||
| python -m project3.naiveBayes -t path/to/training.txt -c path/to/test.txt -m 500 -s | ||
|
|
||
| # kNN | ||
| python -m kNN.kNN -t path/to/training.txt -r termRankings.json -a chi-square -K 5 -s euclidean | ||
| ``` | ||
| Both modules are executable via `if __name__ == "__main__"` blocks with argparse. | ||
|
|
||
| ### Testing | ||
| ```bash | ||
| pytest # Run all tests | ||
| pre-commit run --all-files # Run black, mypy, flake8 | ||
| ``` | ||
| - Tests are smoke tests (import validation), not functional tests | ||
| - Fixture data in `tests/fixtures/sample_records.txt` | ||
|
|
||
| ### Code Quality | ||
| - **black**: Auto-formatting (enforced in pre-commit) | ||
| - **mypy**: Type checking with `--ignore-missing-imports` in pre-commit hook | ||
| - **flake8**: Style linting | ||
| - CI runs pre-commit checks on push/PR via GitHub Actions (`.github/workflows/pre-commit-ci.yml`) | ||
|
|
||
| ## Common Pitfalls | ||
|
|
||
| 1. **Persistence**: Don't use `pickle.dump()` directly - use `lib.persist` or `storeObject()` wrapper | ||
| 2. **Regex parsing**: XML parsing uses fragile regex patterns - test against `tests/fixtures/sample_records.txt` format | ||
| 3. **Feature vectors**: Multiple functions create term vectors with different schemas - check parameter signatures | ||
| 4. **Error handling**: Most functions silently catch exceptions and print messages - no structured logging | ||
| 5. **Windows paths**: Commented code shows Windows absolute paths (e.g., `C:\temp\datasets\`) - avoid hardcoding | ||
|
|
||
| ## Key Files for Reference | ||
| - [README.md](../README.md) - CLI usage and XML schema | ||
| - [CONTRIBUTING.md](../CONTRIBUTING.md) - Developer setup process | ||
| - [mypy.ini](../mypy.ini) - Type checking configuration | ||
| - [tests/fixtures/sample_records.txt](../tests/fixtures/sample_records.txt) - XML format examples |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| name: CI | ||
|
|
||
| on: | ||
| push: | ||
| branches: [ main, master, cleanup_and_standardization ] | ||
| pull_request: | ||
| branches: [ main, master ] | ||
|
|
||
| jobs: | ||
| install-deps: | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| python-version: [3.9] | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| - name: Upgrade pip | ||
| run: pip install --upgrade pip | ||
| - name: Install system build deps | ||
| run: sudo apt-get update && sudo apt-get install -y build-essential libpq-dev | ||
| - name: Install Python deps | ||
| run: | | ||
| pip install -r requirements.txt | ||
| - name: Install dev tools | ||
| run: | | ||
| pip install flake8 black mypy | ||
|
|
||
| lint: | ||
| needs: install-deps | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| python-version: [3.9] | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| - name: Upgrade pip | ||
| run: pip install --upgrade pip | ||
| - name: Install flake8 | ||
| run: pip install flake8 | ||
| - name: Lint (flake8) | ||
| run: flake8 | ||
|
|
||
| tests: | ||
| needs: install-deps | ||
| runs-on: ubuntu-latest | ||
| env: | ||
| PYTHONPATH: $GITHUB_WORKSPACE | ||
| strategy: | ||
| matrix: | ||
| python-version: [3.9] | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| - name: Export PYTHONPATH for job | ||
| run: | | ||
| echo "PYTHONPATH=$(pwd)" >> $GITHUB_ENV | ||
| - name: Install Python deps | ||
| run: | | ||
| pip install -r requirements.txt | ||
| - name: Install package (editable) | ||
| run: | | ||
| pip install -e . | ||
| - name: Run tests | ||
| run: | | ||
| pytest -q | ||
|
|
||
| format: | ||
| needs: install-deps | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| python-version: [3.9] | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| - name: Upgrade pip | ||
| run: pip install --upgrade pip | ||
| - name: Install black | ||
| run: pip install black | ||
| - name: Check formatting (black) | ||
| run: black --check . | ||
|
|
||
| types: | ||
| needs: install-deps | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| python-version: [3.9] | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| - name: Upgrade pip | ||
| run: pip install --upgrade pip | ||
| - name: Install mypy | ||
| run: pip install mypy | ||
| - name: Run mypy | ||
| run: mypy . | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| name: pre-commit checks | ||
|
|
||
| on: | ||
| push: | ||
| branches: [ main, master ] | ||
| pull_request: | ||
| branches: [ main, master ] | ||
|
|
||
| jobs: | ||
| pre-commit: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: '3.9' | ||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r requirements.txt pre-commit | ||
| - name: Run pre-commit | ||
| run: | | ||
| pre-commit run --all-files |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| repos: | ||
| - repo: https://github.com/psf/black | ||
| rev: 25.11.0 | ||
| hooks: | ||
| - id: black | ||
| language_version: python3 | ||
|
|
||
| - repo: local | ||
| hooks: | ||
| - id: mypy | ||
| name: mypy (local) | ||
| entry: python -m mypy | ||
| language: system | ||
| types: [python] | ||
| args: ["--ignore-missing-imports"] | ||
|
|
||
| - repo: local | ||
| hooks: | ||
| - id: flake8 | ||
| name: flake8 (local) | ||
| entry: python -m flake8 | ||
| language: system | ||
| types: [python] | ||
| args: ["--max-line-length=160", "--extend-ignore=E203,E265,F841,E712"] | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| Developer setup | ||
| ---------------- | ||
|
|
||
| Run the following to install developer tooling and enable pre-commit hooks locally: | ||
|
|
||
| 1. Install dev tools (recommended to use a virtualenv): | ||
|
|
||
| ```bash | ||
| py -3 -m pip install --upgrade pip | ||
| py -3 -m pip install -r requirements.txt pre-commit | ||
| ``` | ||
|
|
||
| 2. Install the pre-commit hooks for this repository: | ||
|
|
||
| ```bash | ||
| pre-commit install | ||
| ``` | ||
|
|
||
| 3. To run hooks against all files (useful for CI or before your first commit): | ||
|
|
||
| ```bash | ||
| pre-commit run --all-files | ||
| ``` | ||
|
|
||
| Notes: | ||
| - `black` enforces consistent formatting. | ||
| - `mypy` performs static type checks; the pre-commit mypy hook is configured with `--ignore-missing-imports`. | ||
| - `flake8` enforces basic style rules. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| """kNN package init""" | ||
|
|
||
| from . import kNN # noqa: F401 - exported for convenience |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch
cleanup_and_standardizationappears to be a feature branch and should not be in the CI trigger list for pushes. Remove it to avoid running CI on every push to this temporary branch.