Skip to content

Commit 1928938

Browse files
committed
docs: buyer-safe README (consistent CLI, flags table, version banner, examples); add entrypoint
1 parent 8551205 commit 1928938

File tree

2 files changed

+84
-18
lines changed

2 files changed

+84
-18
lines changed

README.md

Lines changed: 83 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
<p align="center">
88
<a href="https://pypi.org/project/iab-mapper/"><img alt="PyPI" src="https://img.shields.io/pypi/v/iab-mapper.svg"></a>
9+
<a href="https://img.shields.io/pypi/dm/iab-mapper"><img alt="Downloads" src="https://img.shields.io/pypi/dm/iab-mapper"></a>
910
<a href="https://github.com/mixpeek/iab-mapper/actions"><img alt="CI" src="https://github.com/mixpeek/iab-mapper/actions/workflows/ci.yml/badge.svg"></a>
1011
<a href="https://github.com/mixpeek/iab-mapper/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-blue.svg"></a>
1112
</p>
@@ -37,6 +38,14 @@ Outputs are **IAB‑3.0–compatible IDs** for OpenRTB/VAST, with optional **vec
3738

3839
---
3940

41+
### Versioning snapshot
42+
43+
| IAB 2.x supported | IAB 3.x supported | Updated |
44+
|-------------------|-------------------|---------------|
45+
| 2.2 | 3.1 | 2025-09-12 |
46+
47+
---
48+
4049
### Update catalogs (fetch latest from IAB)
4150

4251
Use the bundled fetcher to sync to the latest Content Taxonomy files from the official IAB GitHub repository. It will locate the latest 2.x and 3.x datasets and normalize them into this tool’s schemas.
@@ -147,11 +156,21 @@ Replace the stub `data/*.json` with your **full IAB catalogs** (include `id`, `l
147156
## 🚀 Quick Start
148157

149158
```bash
150-
# map the sample CSV using fuzzy matching only
151-
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json
159+
# simplest path: fuzzy only, CSV in → JSON out
160+
iab-mapper sample_2x_codes.csv -o mapped.json
161+
162+
# enable local embeddings (improves recall on free‑text labels)
163+
iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings
164+
```
165+
166+
OpenRTB and VAST helpers (example output):
167+
168+
```json
169+
{"content":{"cat":["3-5-2","1026","1068"],"cattax":"2"}}
170+
```
152171

153-
# enable local embeddings (improves recall on free-text labels)
154-
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings
172+
```text
173+
"3-5-2","1026","1068"
155174
```
156175

157176
The output contains for each input row:
@@ -276,18 +295,17 @@ code,label,channel,type,format,language,source,environment
276295

277296
## ⚙️ Useful Flags
278297

279-
```bash
280-
# thresholds
281-
--fuzzy-cut 0.92 # 0..1 (higher = stricter)
282-
--use-embeddings # enable local embeddings
283-
--emb-model all-MiniLM-L6-v2
284-
--emb-cut 0.80 # cosine similarity cut
285-
--max-topics 3 # max topic categories per row
286-
--drop-scd # exclude SCD nodes from results
287-
--cattax 2 # set OpenRTB content.cattax enum for Content Taxonomy
288-
--overrides overrides.json# JSON overrides applied before matching
289-
--unmapped-out misses.json# write rows with no topic_ids to file
290-
```
298+
| Flag | Default | What it does |
299+
|------|---------|--------------|
300+
| `--fuzzy-cut` | `0.92` | Stricter = fewer, higher-confidence matches |
301+
| `--use-embeddings` | off | Enable local embeddings for near-miss labels |
302+
| `--emb-model` | `all-MiniLM-L6-v2` | Sentence-Transformers model or `tfidf` |
303+
| `--emb-cut` | `0.80` | Cosine similarity threshold for embeddings |
304+
| `--max-topics` | `3` | Cap topic IDs per row |
305+
| `--drop-scd` | off | Exclude Sensitive Content nodes |
306+
| `--cattax` | `2` | OpenRTB `content.cattax` enum |
307+
| `--unmapped-out` || Write misses to file for audit |
308+
| `--overrides` || Force mappings before match |
291309

292310
---
293311

@@ -334,6 +352,25 @@ python scripts/eval.py mapped.json gold.json
334352
```
335353
Gate releases on accuracy deltas so behavior stays stable for audits.
336354

355+
Minimal starter:
356+
357+
```json
358+
// scripts/gold.json
359+
[{"in_label":"Sports","topic_ids":["483"]}]
360+
```
361+
362+
```python
363+
# scripts/eval.py (toy example)
364+
import json, sys
365+
pred = { (r.get('in_label')): set(r.get('topic_ids',[])) for r in json.load(open(sys.argv[1])) }
366+
gold = { (r.get('in_label')): set(r.get('topic_ids',[])) for r in json.load(open(sys.argv[2])) }
367+
tp=fp=fn=0
368+
for k in gold:
369+
g=gold[k]; p=pred.get(k,set())
370+
tp += len(g & p); fp += len(p - g); fn += len(g - p)
371+
print({'tp':tp,'fp':fp,'fn':fn})
372+
```
373+
337374
---
338375

339376
## 🛠️ Updating Catalogs
@@ -347,6 +384,34 @@ Commit with a version bump and note `taxonomy_version` in your release notes.
347384

348385
---
349386

387+
## 🔐 Security & operations
388+
389+
- Local-first: processing happens on your machine; no external APIs needed.
390+
- No PII required; CSV/JSON processed in-memory.
391+
- Air‑gapped: prebundle ST model and run `iab-mapper` fully offline.
392+
393+
---
394+
395+
## 🤝 Using Mixpeek API (optional)
396+
397+
If you prefer managing catalogs, outputs, and audits centrally, you can run mapping locally and then persist results via Mixpeek for auditability.
398+
399+
```http
400+
# 1) create collection
401+
POST /collections { "name": "iab-taxonomy" }
402+
403+
# 2) create 'document' with 2.x codes
404+
POST /collections/{id}/documents { "document_id":"iab-2x", "properties": { ... } }
405+
406+
# 3) run taxonomy feature extractor (2.x → 3.0)
407+
POST /collections/{id}/documents/{doc}/features { "extractor":"taxonomy", "params":{"target_version":"3.0"} }
408+
409+
# 4) fetch enriched doc
410+
GET /collections/{id}/documents/{doc}
411+
```
412+
413+
See also: [Taxonomy Mapper tool](/tools/iab-taxonomy-mapper), [Taxonomy audit tool](/tools/taxonomy-audit), [Video guide](/education/videos/taxonomies-guide), and the landing page at [mxp.co/taxonomy](https://mxp.co/taxonomy).
414+
350415
## 🧯 Troubleshooting
351416
- **No matches:** lower `--fuzzy-cut` or enable `--use-embeddings`.
352417
- **Weird matches:** raise thresholds; add synonyms into `synonyms_*.json`.
@@ -359,10 +424,10 @@ Commit with a version bump and note `taxonomy_version` in your release notes.
359424
## 📦 Example Commands
360425
```bash
361426
# Strict fuzzy only
362-
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.csv --fuzzy-cut 0.95
427+
iab-mapper sample_2x_codes.csv -o mapped.csv --fuzzy-cut 0.95
363428

364429
# Embeddings on, drop SCD, max 2 topics, custom cattax, collect unmapped
365-
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-scd --max-topics 2 --cattax 2 --unmapped-out misses.json
430+
iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-scd --max-topics 2 --cattax 2 --unmapped-out misses.json
366431
```
367432

368433
---

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ emb = [
3838

3939
[project.scripts]
4040
mixpeek-iab-mapper = "iab_mapper.cli:app"
41+
iab-mapper = "iab_mapper.cli:app"
4142

4243
[build-system]
4344
requires = ["setuptools>=68", "wheel"]

0 commit comments

Comments
 (0)