A modern, embeddable corpus query engine with first-class support for aligned corpora.
montre (/mɔ̃tʁ/): “shows,” “reveals,” “makes visible” — from French montrer, “to show.” The Latin root is monstrare, “to point out, indicate.”
No server, external services, or prerequisites.
A corpus is a self-contained directory with its own data, indexes, and (optionally) alignments. Build it in one line from your annotation files, or from a TOML manifest describing multiple components.
Designed to be used from the CLI or embedded directly in Julia or Python.
curl -fsSL https://raw.githubusercontent.com/myersm0/montre/main/install.sh | sh# Build a corpus from a directory of CoNLL-U files:
montre build -i data/maupassant/ -o my-corpus/
# Query
montre query my-corpus/ '[pos="ADJ"] [pos="NOUN"]'
# Count
montre count my-corpus/ '[pos="ADJ"] [pos="NOUN"]'
montre count my-corpus/ '[pos="NOUN"]' --by-document
montre count my-corpus/ '[pos="NOUN"]' --by-component
# Filter
montre query my-corpus/ '[pos="ADJ"] [pos="NOUN"]' --document la-parure
montre query my-corpus/ '[pos="ADJ"] [pos="NOUN"]' --component fr
# Inspect
montre info my-corpus/
montre docs my-corpus/
montre layers my-corpus/
montre vocab my-corpus/ pos
montre vocab my-corpus/ lemma --top 50 --component frMontre uses a CQL-based language, extended with labels, constraints, and alignment-aware operations.
# Token queries
[pos="NOUN"]
[lemma="maison"]
[word="chat" & pos="NOUN"]
[lemma=/^un.*/]
[pos!="PUNCT"]
# Sequences
[pos="DET"] [pos="ADJ"]* [pos="NOUN"]
# Quantifiers
[pos="ADJ"]+
[pos="ADJ"]*
[pos="ADJ"]?
[pos="ADJ"]{2,4}
# Alternation
([pos="ADJ"] | [pos="ADV"])+ [pos="NOUN"][pos="DET"] [pos="NOUN"] within s
[lemma="chat"] within docRequires using the flag --decompose-feats at build time.
[pos="NOUN" & feats.Number="Plur"]
[feats.Gender="Masc" & feats.Tense="Past"][pos="NOUN"] within component:fr
[pos="ADJ"] [pos="NOUN"] within doc:"la-parure","boule-de-suif"a:[pos="NOUN"] []* b:[pos="NOUN"] :: a.lemma = b.lemma
a:[pos="ADJ"] b:[pos="NOUN"] :: a.lemma != b.lemma
a:[] []{0,20} b:[] :: distance(a,b) >= 5Constraints are evaluated over full matches using labeled spans.
Montre was designed from the ground up specifically for parallel corpora.
Montre treats a parallel corpus as a single object with multiple components and explicit alignment relations, rather than as separate corpora joined at query time.
- Multiple components (languages, editions, translations)
- Named alignments at any span level (sentence, paragraph, stanza)
- Multiple competing alignment sets (LaBSE, vecalign, manual)
- Alignment projection between components
# Query French, project to English
[lemma="maison"] within component:fr =labse=>This enables:
- tracing translations across languages
- detecting omissions or expansions
- comparing editions or variants
[corpus]
name = "isosceles"
decompose_feats = true
[components.maupassant-fr]
path = "data/maupassant/fr/conllu/"
language = "fr"
[components.maupassant-en]
path = "data/maupassant/en/conllu/"
language = "en"
[alignments.labse]
source = "maupassant-fr"
target = "maupassant-en"
edges = "alignments/labse/"
source_layer = "sentence"
target_layer = "sentence"montre build -m corpus.toml -o my-corpus/Montre is competitive with established corpus engines while prioritizing structural flexibility and embeddability.
On a 1.5M token corpus (Maupassant French/English, Apple M4 Max):
| Query | Matches | Time |
|---|---|---|
[pos="NOUN"] |
244,184 | 0.6ms |
[pos="ADJ"] [pos="NOUN"] |
30,672 | 12ms |
[pos="ADJ"]? [pos="NOUN"] |
272,019 | 71ms |
([pos="ADJ"] | [pos="ADV"])+ [pos="NOUN"] |
33,444 | 27ms |
([pos="ADJ"] | [pos="DET"])+ [pos="NOUN"] |
198,735 | 71ms |
- Quantifiers use a run-based execution model (scales with matches, not corpus size)
--count-onlyavoids hit allocation entirely (nanosecond-scale for simple queries)- Memory-mapped indexes reduce load time and memory footprint by an order of magnitude
Montre exposes a C FFI for embedding in other languages.
using Montre
corpus = open_corpus("./my-corpus")
hits = query(corpus, "[pos=\"ADJ\"] [pos=\"NOUN\"]")
for line in concordance(corpus, hits)
println(line)
endBindings via PyO3 are in progress.
import montre
corpus = montre.open("./my-corpus")
for hit in corpus.query('[pos="DET"] [pos="NOUN"]'):
print(hit.start, hit.end)Coming soon:
- Statistics: group, collocation
- Python bindings (feature-complete, pip install)
- REPL (persistent corpus session)
- TUI for interactive exploration
- Support for additional input formats (VRT, Stanza JSON, TEI)
A paper describing Montre is in preparation. In the meantime, if you use Montre in published research, please cite:
@software{myers-montre,
author = {Myers, Michael J.},
title = {Montre: A Modern Corpus Query Engine for Aligned Corpora},
year = {2026},
url = {https://github.com/myersm0/montre},
version = {0.4.0}
}Apache-2.0