Skip to content

Commit 04fef44

Browse files
committed
chore: release 0.3.1 (CLI updater, docs, refreshed catalogs)
1 parent c03251c commit 04fef44

File tree

11 files changed

+2340
-14
lines changed

11 files changed

+2340
-14
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22

33
All notable changes to this project will be documented here.
44

5+
## [0.3.1] - 2025-09-10
6+
- CLI: add `update-catalogs` subcommand (fetch exact or latest 2.x/3.x)
7+
- Script: `scripts/update_catalogs.py` robust TSV header detection
8+
- README: document catalog updater (script and CLI)
9+
- Data: refreshed catalogs (3.1 and 2.2)
10+
511
## [0.3.0] - 2025-09-09
612
- Vectors catalogs and output wiring
713
- Configurable cattax flag

README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,11 @@ Outputs are **IAB‑3.0–compatible IDs** for OpenRTB/VAST, with optional **vec
4242
Use the bundled fetcher to sync to the latest Content Taxonomy files from the official IAB GitHub repository. It will locate the latest 2.x and 3.x datasets and normalize them into this tool’s schemas.
4343

4444
```bash
45+
# via Python script (direct)
4546
python scripts/update_catalogs.py
47+
48+
# or via CLI command
49+
mixpeek-iab-mapper update-catalogs --exact3 "3.1" --exact2 "2.2"
4650
# Optional: use a GitHub token to raise rate limits
4751
# export GITHUB_TOKEN=ghp_...
4852
```
@@ -312,13 +316,13 @@ Each value maps to a **stable IAB 3.0 ID** that is appended to the `cat` array.
312316

313317
## 📎 Official IAB References
314318

315-
- Content Taxonomy 3.0 Implementation Guide (PDF): `https://iabtechlab.com/wp-content/uploads/2021/09/Implementation-Guide-Content-Taxonomy-3-0-pc-Sept2021.pdf`
316-
- IAB Tech Lab Content Taxonomy page: `https://iabtechlab.com/standards/content-taxonomy/`
319+
- Content Taxonomy 3.0 Implementation Guide (PDF): https://iabtechlab.com/wp-content/uploads/2021/09/Implementation-Guide-Content-Taxonomy-3-0-pc-Sept2021.pdf
320+
- IAB Tech Lab Content Taxonomy page: https://iabtechlab.com/standards/content-taxonomy/
317321
- Implementation guidance (historic mappings and migration notes):
318-
- `https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/implementation.md#content-21-to-ad-product-20-taxonomy-mapping-implementation-guidance`
319-
- `https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Taxonomy%20Mappings/Ad%20Product%202.0%20to%20Content%202.1.tsv`
320-
- `https://github.com/katieshell/Taxonomies/blob/main/implementation.md#implementation-guidance-for-content-1--content-2-mapping`
321-
- `https://github.com/katieshell/Taxonomies/blob/main/implementation.md#migrating-from-content-taxonomy-10`
322+
- https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/implementation.md#content-21-to-ad-product-20-taxonomy-mapping-implementation-guidance
323+
- https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Taxonomy%20Mappings/Ad%20Product%202.0%20to%20Content%202.1.tsv
324+
- https://github.com/katieshell/Taxonomies/blob/main/implementation.md#implementation-guidance-for-content-1--content-2-mapping
325+
- https://github.com/katieshell/Taxonomies/blob/main/implementation.md#migrating-from-content-taxonomy-10
322326

323327
---
324328

@@ -364,7 +368,7 @@ mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-sc
364368
---
365369

366370
## 📜 License
367-
MIT. See `LICENSE`.
371+
MIT. See [LICENSE](LICENSE).
368372

369373
Include IAB attribution in your deployed UI/footer:
370374
> “IAB is a registered trademark of the Interactive Advertising Bureau. This tool is an independent utility built by Mixpeek for interoperability with IAB Content Taxonomy standards.”

iab_mapper.egg-info/PKG-INFO

Lines changed: 182 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,25 @@ Metadata-Version: 2.4
22
Name: iab-mapper
33
Version: 0.3.0
44
Summary: Local IAB Content Taxonomy 2.x -> 3.0 mapper with vectors, SCD, OpenRTB/VAST exporters.
5+
Author: Mixpeek
6+
License: MIT
7+
Project-URL: Homepage, https://github.com/mixpeek/iab-mapper
8+
Project-URL: Repository, https://github.com/mixpeek/iab-mapper
9+
Project-URL: Issues, https://github.com/mixpeek/iab-mapper/issues
10+
Keywords: iab,taxonomy,content,openrtb,ctv,classification
11+
Classifier: License :: OSI Approved :: MIT License
12+
Classifier: Programming Language :: Python :: 3
13+
Classifier: Programming Language :: Python :: 3 :: Only
14+
Classifier: Programming Language :: Python :: 3.9
15+
Classifier: Programming Language :: Python :: 3.10
16+
Classifier: Programming Language :: Python :: 3.11
17+
Classifier: Programming Language :: Python :: 3.12
18+
Classifier: Operating System :: OS Independent
19+
Classifier: Intended Audience :: Developers
20+
Classifier: Topic :: Software Development :: Libraries
521
Requires-Python: >=3.9
622
Description-Content-Type: text/markdown
23+
License-File: LICENSE
724
Requires-Dist: pandas>=2.1
825
Requires-Dist: rapidfuzz>=3.0
926
Requires-Dist: typer>=0.12
@@ -15,13 +32,66 @@ Requires-Dist: scikit-learn>=1.4
1532
Requires-Dist: requests>=2.31
1633
Provides-Extra: emb
1734
Requires-Dist: sentence-transformers>=3.0; extra == "emb"
35+
Dynamic: license-file
36+
37+
<p align="center">
38+
<img src="assets/header.png" alt="IAB Taxonomy Mapper" width="900" />
39+
</p>
1840

1941
# IAB Content Taxonomy Mapper (Local CLI)
2042

21-
Map **IAB Content Taxonomy 2.x** labels/codes to **IAB 3.0** locally with a deterministic→fuzzy→(optional) local-embeddings pipeline.
22-
Outputs are **IAB-3.0–compatible IDs** suitable for OpenRTB/VAST, with optional **vector attributes** (Channel, Type, Format, Language, Source, Environment) and **SCD** awareness.
43+
<p align="center">
44+
<a href="https://pypi.org/project/iab-mapper/"><img alt="PyPI" src="https://img.shields.io/pypi/v/iab-mapper.svg"></a>
45+
<a href="https://github.com/mixpeek/iab-mapper/actions"><img alt="CI" src="https://github.com/mixpeek/iab-mapper/actions/workflows/ci.yml/badge.svg"></a>
46+
<a href="https://github.com/mixpeek/iab-mapper/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-blue.svg"></a>
47+
</p>
48+
49+
Map **IAB Content Taxonomy 2.x** labels/codes to **IAB 3.0** locally with a deterministic → fuzzy → (optional) semantic pipeline.
50+
Outputs are **IAB‑3.0–compatible IDs** for OpenRTB/VAST, with optional **vector attributes** (Channel, Type, Format, Language, Source, Environment) and **SCD** awareness.
51+
52+
> Local-first by default. No external APIs are required; LLM re‑rank is optional.
53+
54+
---
55+
56+
## 📚 Table of Contents
57+
58+
- [✨ Features](#-features)
59+
- [Why migrate to IAB 3.0?](#-why-migrate-to-iab-30)
60+
- [How it works](#-how-it-works)
61+
- [🔧 Install](#-install)
62+
- [🚀 Quick Start](#-quick-start)
63+
- [🐍 Python API](#-python-api-alternative-to-cli)
64+
- [📥 Input Formats](#-input-formats)
65+
- [📤 Output Formats](#-output-formats)
66+
- [⚙️ Useful Flags](#️-useful-flags)
67+
- [🧩 Vectors](#-vectors-orthogonal-attributes)
68+
- [✅ IAB 3.0 Conformance Notes](#-iab-30-conformance-notes)
69+
- [📎 Official IAB References](#-official-iab-references)
70+
- [🧯 Troubleshooting](#-troubleshooting)
71+
- [📦 Example Commands](#-example-commands)
72+
- [📜 License](#-license)
73+
74+
---
75+
76+
### Update catalogs (fetch latest from IAB)
77+
78+
Use the bundled fetcher to sync to the latest Content Taxonomy files from the official IAB GitHub repository. It will locate the latest 2.x and 3.x datasets and normalize them into this tool’s schemas.
79+
80+
```bash
81+
# via Python script (direct)
82+
python scripts/update_catalogs.py
83+
84+
# or via CLI command
85+
mixpeek-iab-mapper update-catalogs --exact3 "3.1" --exact2 "2.2"
86+
# Optional: use a GitHub token to raise rate limits
87+
# export GITHUB_TOKEN=ghp_...
88+
```
89+
90+
Outputs:
91+
- `iab_mapper/data/iab_2x.json` → `[{"code","label"}]`
92+
- `iab_mapper/data/iab_3x.json` → `[{"id","label","path":[],"scd":bool}]`
2393

24-
> No external APIs. Runs fully local. LLMs are **not required**. You can enable local embeddings for tougher matches.
94+
Replace or extend `synonyms_*.json` and `vectors_*.json` as needed for your org.
2595

2696
---
2797

@@ -35,8 +105,34 @@ Outputs are **IAB-3.0–compatible IDs** suitable for OpenRTB/VAST, with optiona
35105

36106
---
37107

108+
## 🔎 Why migrate to IAB 3.0?
109+
110+
- 3.0 introduces clearer separation of primary topic “aboutness” vs. orthogonal vectors (e.g., news vs. opinion, formats, channels).
111+
- Better support for CTV/video, podcasts, games, and app stores.
112+
- Non‑backwards compatible in areas like News/Opinion and entertainment genres; careful migration is required.
113+
114+
This tool makes migration practical: it emits valid 3.0 IDs and helps curate edge cases with overrides, synonyms, thresholds, and audit outputs.
115+
116+
---
117+
118+
## 🧠 How it works
119+
120+
1) Normalize text and apply alias/exact matches via synonyms.
121+
2) Fuzzy retrieval (rapidfuzz | TF‑IDF | BM25) with configurable thresholds.
122+
3) Optional semantic augmentation with local embeddings (Sentence‑Transformers or TF‑IDF KNN).
123+
4) Optional local LLM re‑ranking (Ollama) for ordering only.
124+
5) Assemble outputs: topic IDs + vector IDs → OpenRTB `content.cat` with configurable `cattax`.
125+
6) SCD flags are surfaced and can be excluded with `--drop-scd`.
126+
127+
---
128+
38129
## 🔧 Install
39130

131+
### From PyPI (recommended)
132+
```bash
133+
pip install iab-mapper
134+
```
135+
40136
### 1) Clone / unpack
41137
```bash
42138
unzip iab-mapper.zip && cd iab-mapper
@@ -102,6 +198,74 @@ The output contains for each input row:
102198

103199
---
104200

201+
## 🐍 Python API (alternative to CLI)
202+
203+
Install:
204+
```bash
205+
pip install iab-mapper
206+
```
207+
208+
Basic usage:
209+
```python
210+
from pathlib import Path
211+
from iab_mapper.pipeline import Mapper, MapConfig
212+
import iab_mapper as pkg
213+
214+
# Use packaged stub catalogs or point data_dir to your own
215+
data_dir = Path(pkg.__file__).parent / "data"
216+
217+
cfg = MapConfig(
218+
fuzzy_method="bm25", # rapidfuzz|tfidf|bm25
219+
fuzzy_cut=0.92,
220+
use_embeddings=False, # set True and choose emb_model to enable
221+
max_topics=3,
222+
drop_scd=False,
223+
cattax="2", # OpenRTB content.cattax enum
224+
overrides_path=None # path to JSON overrides if desired
225+
)
226+
227+
mapper = Mapper(cfg, str(data_dir))
228+
229+
# Single record with optional vectors
230+
rec = {
231+
"code": "2-12",
232+
"label": "Food & Drink",
233+
"channel": "editorial",
234+
"type": "article",
235+
"format": "video",
236+
"language": "en",
237+
"source": "professional",
238+
"environment": "ctv",
239+
}
240+
241+
out = mapper.map_record(rec)
242+
print(out["out_ids"]) # topic + vector IDs
243+
print(out["openrtb"]) # {"content": {"cat": [...], "cattax": "2"}}
244+
print(out["vast_contentcat"]) # "id1","id2",...
245+
246+
# Or just map topics
247+
topics = mapper.map_topics("Cooking how-to")
248+
249+
# Batch over a list of dicts
250+
rows = [rec, {"label": "Sports"}]
251+
mapped = [mapper.map_record(r) for r in rows]
252+
```
253+
254+
Enable local embeddings (optional):
255+
```python
256+
cfg = MapConfig(fuzzy_method="rapidfuzz", use_embeddings=True, emb_model="tfidf", emb_cut=0.8)
257+
mapper = Mapper(cfg, str(data_dir))
258+
out = mapper.map_record({"label": "Cooking how-to"})
259+
```
260+
261+
Use overrides (force mapping before matching):
262+
```python
263+
cfg = MapConfig(overrides_path="overrides.json") # [{"code":"1-4","label":null,"ids":["2-3-18"]}]
264+
mapper = Mapper(cfg, str(data_dir))
265+
```
266+
267+
---
268+
105269
## 📥 Input Formats
106270

107271
### CSV
@@ -186,6 +350,18 @@ Each value maps to a **stable IAB 3.0 ID** that is appended to the `cat` array.
186350

187351
---
188352

353+
## 📎 Official IAB References
354+
355+
- Content Taxonomy 3.0 Implementation Guide (PDF): https://iabtechlab.com/wp-content/uploads/2021/09/Implementation-Guide-Content-Taxonomy-3-0-pc-Sept2021.pdf
356+
- IAB Tech Lab Content Taxonomy page: https://iabtechlab.com/standards/content-taxonomy/
357+
- Implementation guidance (historic mappings and migration notes):
358+
- https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/implementation.md#content-21-to-ad-product-20-taxonomy-mapping-implementation-guidance
359+
- https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Taxonomy%20Mappings/Ad%20Product%202.0%20to%20Content%202.1.tsv
360+
- https://github.com/katieshell/Taxonomies/blob/main/implementation.md#implementation-guidance-for-content-1--content-2-mapping
361+
- https://github.com/katieshell/Taxonomies/blob/main/implementation.md#migrating-from-content-taxonomy-10
362+
363+
---
364+
189365
## 🔬 Evaluation (recommended)
190366
Create a small gold set for your domain and run periodic checks:
191367
```bash
@@ -228,5 +404,7 @@ mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-sc
228404
---
229405

230406
## 📜 License
231-
TBD by Mixpeek. Include IAB attribution in your deployed UI/footer:
407+
MIT. See [LICENSE](LICENSE).
408+
409+
Include IAB attribution in your deployed UI/footer:
232410
> “IAB is a registered trademark of the Interactive Advertising Bureau. This tool is an independent utility built by Mixpeek for interoperability with IAB Content Taxonomy standards.”

iab_mapper.egg-info/SOURCES.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
LICENSE
12
README.md
23
pyproject.toml
34
iab_mapper/__init__.py
@@ -8,6 +9,7 @@ iab_mapper/llm.py
89
iab_mapper/matching.py
910
iab_mapper/normalize.py
1011
iab_mapper/pipeline.py
12+
iab_mapper/updater.py
1113
iab_mapper.egg-info/PKG-INFO
1214
iab_mapper.egg-info/SOURCES.txt
1315
iab_mapper.egg-info/dependency_links.txt

iab_mapper/cli.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from rich.console import Console
44
from rich.table import Table
55
from iab_mapper.pipeline import Mapper, MapConfig
6+
from iab_mapper.updater import update_catalogs
67

78
app = typer.Typer(add_completion=False)
89
con = Console()
@@ -78,3 +79,17 @@ def run(
7879
Path(unmapped_out).write_text(json.dumps(unmapped, ensure_ascii=False), encoding="utf-8")
7980
con.print(f"Unmapped → {unmapped_out} ({len(unmapped)})")
8081
if __name__=="__main__": app()
82+
83+
84+
@app.command(name="update-catalogs")
85+
def update_catalogs_cmd(
86+
data_dir: Path = typer.Option(Path(__file__).parent / "data", help="Data dir to write catalogs"),
87+
major3: int = typer.Option(3, help="Major version to pick for Content Taxonomy 3.x (e.g., 3)"),
88+
major2: int = typer.Option(2, help="Major version to pick for Content Taxonomy 2.x (e.g., 2)"),
89+
exact3: str = typer.Option(None, help="Exact filename substring for 3.x (e.g., '3.1')"),
90+
exact2: str = typer.Option(None, help="Exact filename substring for 2.x (e.g., '2.2')"),
91+
token: str = typer.Option(None, help="GitHub token (overrides env GITHUB_TOKEN)"),
92+
):
93+
"""Fetch latest IAB catalogs from IAB GitHub and normalize into JSON."""
94+
update_catalogs(str(data_dir), major3=major3, major2=major2, exact3=exact3, exact2=exact2, token=token)
95+
con.print(f"Updated catalogs in {data_dir}")

iab_mapper/data/iab_2x.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

iab_mapper/data/iab_3x.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)