Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
c5ea6ab
Create pyproject.toml
joewandy Jun 4, 2025
19f6737
Switch to MIT license
joewandy Jun 4, 2025
ccee63f
Merge pull request #35 from joewandy/codex/update-pyproject-for-hlda
joewandy Jun 4, 2025
4be3909
Update pyproject.toml
joewandy Jun 4, 2025
7a36a33
Update poetry.lock
joewandy Jun 4, 2025
e32ae93
Merge pull request #37 from joewandy/codex/update-poetry-lock-file
joewandy Jun 4, 2025
4e1ad54
Add invariants test for tree properties
joewandy Jun 4, 2025
6874c3a
Merge pull request #38 from joewandy/codex/fix-bugs-in-ncrpnode.drop_…
joewandy Jun 4, 2025
11c1a1b
Make BBC demo deterministic and add regression test
joewandy Jun 4, 2025
b290af4
Merge pull request #39 from joewandy/codex/create-command-line-python…
joewandy Jun 4, 2025
0e448f0
Add GitHub CI workflow for tests
joewandy Jun 4, 2025
702402f
Merge pull request #40 from joewandy/codex/add-github-ci-for-unit-tests
joewandy Jun 4, 2025
f835a25
Fix tests for src layout
joewandy Jun 4, 2025
03ce453
Merge pull request #41 from joewandy/codex/refactor-project-structure…
joewandy Jun 4, 2025
95c72ac
Expand project README
joewandy Jun 4, 2025
e5c4241
Merge pull request #42 from joewandy/codex/update-readme-for-clarity
joewandy Jun 4, 2025
a2b19dc
Expose version and main class
joewandy Jun 4, 2025
6e9853f
Merge pull request #43 from joewandy/codex/create-__version__-constan…
joewandy Jun 4, 2025
607f90e
Add docstrings for sampler and hLDA class
joewandy Jun 4, 2025
b0c506d
Merge pull request #44 from joewandy/codex/add-module-and-class-docst…
joewandy Jun 4, 2025
f8f059d
Add dataset citation info
joewandy Jun 4, 2025
08531df
Merge pull request #45 from joewandy/codex/add-citation-for-bbc-dataset
joewandy Jun 4, 2025
11913ff
Add pre-commit config and update README
joewandy Jun 4, 2025
f28e2b0
Merge pull request #46 from joewandy/codex/add-.pre-commit-config.yam…
joewandy Jun 4, 2025
3836770
Clarify loop comments and wrap lines
joewandy Jun 4, 2025
4121431
Use text mode when loading vocab and corpus
joewandy Jun 4, 2025
e4cae99
Add tests for run_hlda utility functions
joewandy Jun 4, 2025
42a4a28
Merge pull request #47 from joewandy/codex/update-comments-and-ensure…
joewandy Jun 4, 2025
63340cf
Merge pull request #48 from joewandy/codex/update-file-opening-in-loa…
joewandy Jun 4, 2025
3061629
Merge pull request #50 from joewandy/codex/create-tests-for-load_docu…
joewandy Jun 4, 2025
e925a87
Add CLI entry point and document demo command
joewandy Jun 4, 2025
a9334aa
Merge pull request #52 from joewandy/06r27l-codex/add-poetry-script-m…
joewandy Jun 4, 2025
01088d0
Fix flake8 issues and wrap long lines
joewandy Jun 4, 2025
34f35cc
Merge pull request #53 from joewandy/codex/fix-code-style-issues
joewandy Jun 4, 2025
a0d3ba1
Add tests for vocabulary and corpus CSV loaders
joewandy Jun 5, 2025
679f1c0
Allow word column names in synthetic generator
joewandy Jun 5, 2025
5d79c6d
Add tree export feature
joewandy Jun 5, 2025
d4f4b33
Merge pull request #55 from joewandy/codex/investigate-integer-column…
joewandy Jun 5, 2025
7bfc86c
Merge pull request #54 from joewandy/codex/add-tests-for-csv-fixtures…
joewandy Jun 5, 2025
ad9c3c1
Merge pull request #56 from joewandy/codex/implement-tree-export-meth…
joewandy Jun 5, 2025
c73ca86
Add sklearn estimator wrapper
joewandy Jun 5, 2025
381fe90
Merge pull request #57 from joewandy/codex/create-sklearn_wrapper-mod…
joewandy Jun 5, 2025
cf67ddc
Replace notebooks with example scripts
joewandy Jun 5, 2025
06b51b2
Address feedback
joewandy Jun 5, 2025
db38380
Merge pull request #58 from joewandy/codex/replace-notebooks-with-exa…
joewandy Jun 5, 2025
97dd66b
Update lock file
joewandy Jun 5, 2025
946931e
Merge pull request #59 from joewandy/codex/update-poetry-lock-file
joewandy Jun 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
21 changes: 21 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Python Package using Poetry

on: [push]

jobs:
build-linux:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Poetry
run: pip install poetry==2.1.3
- name: Install dependencies
run: poetry install --no-interaction --no-root
- name: Lint with flake8
run: poetry run flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics && poetry run flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: poetry run pytest
16 changes: 16 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
repos:
- repo: https://github.com/psf/black
rev: 25.1.0
hooks:
- id: black
- repo: https://github.com/pycqa/flake8
rev: 7.2.0
hooks:
- id: flake8
- repo: local
hooks:
- id: pytest
name: pytest
entry: pytest
language: system
pass_filenames: false
15 changes: 15 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
cff-version: 1.2.0
message: >
If you use the sample BBC dataset included with this project,
please cite the following publication:
Greene, D., and Cunningham, P. (2006). Practical Solutions to the Problem of
Diagonal Dominance in Kernel Document Clustering. Proceedings of the
23rd International Conference on Machine Learning.
title: "BBC News Dataset"
authors:
- family-names: Greene
given-names: Derek
- family-names: Cunningham
given-names: P.
version: "1.0"
date-released: 2004-01-01
693 changes: 19 additions & 674 deletions LICENSE.txt

Large diffs are not rendered by default.

16 changes: 0 additions & 16 deletions Pipfile

This file was deleted.

961 changes: 0 additions & 961 deletions Pipfile.lock

This file was deleted.

133 changes: 122 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,133 @@ Hierarchical Latent Dirichlet Allocation

---

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing
data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation.
Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic
hierarchies from data. The model relies on a non‑parametric prior called the nested
Chinese restaurant process, which allows for arbitrarily large branching factors and
easily accommodates growing data collections. The hLDA model combines this prior with a
likelihood based on a hierarchical variant of Latent Dirichlet Allocation.

[Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)
The original papers describing the algorithm are:

[The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)
- [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)
- [The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)

Implementation
--------------
## Overview

- [hlda/sampler.py](hlda/sampler.py) is the Gibbs sampler for hLDA inference, based on the implementation from [Mallet](http://mallet.cs.umass.edu/topics.php) having a fixed depth on the nCRP tree.
This repository contains a pure Python implementation of the Gibbs sampler for hLDA.
It is intended for experimentation and as a reference implementation. The code follows
the approach used in the original [Mallet](http://mallet.cs.umass.edu/topics.php)
implementation but with a simplified interface and a fixed depth for the tree.

Key features include:

Installation
------------
- **Python 3.11+** support with minimal third‑party dependencies.
- A small set of example scripts demonstrating how to run the sampler.
- Utilities for visualising the resulting topic hierarchy.
- Test suite for verifying the sampler on synthetic data and a small BBC corpus.

- Simply use `pip install hlda` to install the package.
- An example notebook that infers the hierarchical topics on the BBC Insight corpus can be found in [notebooks/bbc_test.ipynb](notebooks/bbc_test.ipynb).
## Installation

The package can be installed directly from PyPI:

```bash
pip install hlda
```

Alternatively, to develop locally, clone this repository and install it in editable mode:

```bash
git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit install
```

## Usage

The easiest way to get started is by using the sample BBC dataset provided in the
`data/` directory. You can run the full demonstration from the command line:

```bash
python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20
```

If you installed the package from PyPI you can run the same demo via the
`hlda-run` command:

```bash
hlda-run --data-dir data/bbc/tech --iterations 20
```

To write the learned hierarchy to disk in JSON format, pass
`--export-tree <file>` when running the script:

```bash
python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json
```

If you make use of the BBC dataset, please cite the publication by Greene and
Cunningham (2006) as detailed in [`CITATION.cff`](CITATION.cff).

Example scripts for the BBC dataset and synthetic data are available in the
[`examples/`](examples) directory.

Within Python you can also construct the sampler directly:

```python
from hlda.sampler import HierarchicalLDA

corpus = [["word", "word", ...], ...] # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})

hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)
```

### Integration with scikit-learn

The package provides a `HierarchicalLDAEstimator` that follows the scikit-learn API. This allows using the sampler inside a standard `Pipeline`.

```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator

vectorizer = CountVectorizer()
prep = FunctionTransformer(
lambda X: (
[[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
list(vectorizer.get_feature_names_out()),
),
validate=False,
)

pipeline = Pipeline([
("vect", vectorizer),
("prep", prep),
("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])

pipeline.fit(documents)
assignments = pipeline.transform(documents)
```


## Running the tests

The repository includes a small test suite that checks the sampler on both the BBC
corpus and synthetic data. After installing the development dependencies you can run:

```bash
pytest -q
```

All tests should pass in a few seconds.

## License

This project is licensed under the terms of the MIT license. See
[`LICENSE.txt`](LICENSE.txt) for details.

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading