A unified, open, linguistically structured framework for constructing multilingual Biblical text datasets under a canonical, machine-readable schema.
It defines a canonical data model and supporting tooling for producing interoperable, linguistically annotated Biblical text corpora.
The initial focus is on Hebrew, Greek, and English, but the schema is designed to support additional languages and textual traditions.
The project is designed as a neutral foundation that can serve as a common standard for Biblical text computing.
Project Version: 0.2.0
⚠️ Dataset Status: No complete corpus is currently distributed. This repository provides the schema, tooling, documentation, and validated examples required to produce conformant datasets.
- Getting Started
- Documentation
- Versioning
- Purpose
- Intended Audience
- Design Goals
- Non-Goals
- Architecture Overview
- Repository Layout
- Project Status
- Reference Datasets
- Data Sources
- Data Provenance
- Citation
- FAQ
- Contributing
- Project Sustainability
- License
- Source Data Licensing
- Planned Initial Data Stack
Although no full dataset is currently distributed, you can:
- Review the formal schema in
/spec/schema.md - Examine validated example outputs in
/examples/ - Run the validation tool on sample data:
python tools/validate.py examples/
- Use the schema to design or convert your own datasets
- Prepare tooling for future corpus releases
For an overview of repository structure and documentation roles, see the Documentation Map.
This project tracks three independent version axes.
| Component | Current Version | Scope | Changes When |
|---|---|---|---|
| Schema | 0.2.0 | Data structure specification | Structure or validation rules change |
| Dataset | — (not yet released) | Corpus content | Data is added, corrected, or rebuilt |
| Repository | 0.2.0 | Documentation and tooling | Project files change |
Datasets declare the schema version they conform to. These evolve independently and should not be assumed to move in lockstep.
/examples/.
This project provides a foundational data infrastructure layer for Biblical computing that supports:
- Linguistic analysis
- Cross-language alignment
- Translation studies
- Bible software development
- Digital humanities research
- AI training datasets
This project is not a translation, commentary, or theological resource. It models textual structure only.
This project is primarily intended for:
- Biblical scholars and textual researchers
- Computational linguists
- Digital humanities practitioners
- Bible software developers
- AI/ML researchers working with religious texts
- Open and Redistributable - All datasets are legally redistributable (public domain or open academic resources)
- Canonical Verse Anchoring - Verse IDs are stable across all languages and editions
- Separation of Concerns - Canonical structure separated from edition-specific text segmentation
- Deterministic Identifiers - Stable, machine-safe IDs for all entities
- Language-Scoped Lemmas - Prevent cross-language collisions
- External References as Metadata - Strong’s numbers attached to lemmas, not replacing them
- Morpheme-Level Modeling - Supports Hebrew prefixes, suffixes, and Greek compounds
- Interpretation-Neutral - No theology, semantics, or doctrinal tagging in the core dataset
The core dataset does not aim to provide:
- Theological interpretation or doctrinal tagging
- Commentary or study notes
- New translations of Biblical texts
- Manuscript transcription or paleographic data (future extension)
- End-user Bible software features
These may be built as separate layers on top of the core schema.
The data flows through layered entities:
TextSource (edition)
↓
Verse (canonical anchor)
↓
Word (orthographic unit)
↓
Morpheme (linguistic unit)
↓
Lemma (dictionary form)
↓
Alignment (cross-language mapping)
Each layer represents a progressively more abstract linguistic unit derived from the underlying textual source.
Notes:
- Verses exist independently of any specific text edition
- Words and morphemes are edition-specific
- Alignment operates primarily at the morpheme level (or word level where morphemes are not modeled) within a verse.
.
├── ATTRIBUTION.md
├── CHANGELOG.md
├── CITATION.cff
├── LICENSE
├── README.md
├── SUPPORT.md
├── VERSION
│
├── data/
│ ├── core/
│ ├── texts/
│ ├── lemmas/
│ └── alignments/
│
├── examples/
│
├── spec/
│ └── schema.md
│
├── tools/
│ └── validate.py
│
├── scripts/
└── docs/
The data/ directory defines the canonical dataset structure but does not currently contain a complete corpus.
Phase 1 — Core Dataset Architecture (Complete)
The foundational data model and conformance rules are finalized.
- Schema v0.2.0 finalized and locked
- Canonical reference system defined
- Word and morpheme segmentation model defined
- Lemma system defined
- Alignment specification defined
- Conformance rules established
Schema v0.1.0 served as the initial architectural baseline. Schema v0.2.0 introduces clarifications, refinements, and supporting specifications without altering the core design.
Phase 2 — Supporting Infrastructure (Substantially Complete)
Materials required to produce conformant datasets have been developed:
- Detailed specification documents (complete)
- Reference examples (complete)
- Validation rules and tooling (complete)
- Repository documentation (complete)
Automated production pipelines are currently under development.
Phase 3 — Dataset Production (Beginning)
Focus is shifting toward generating the first fully conformant datasets:
- Data production pipelines (in development)
- Reference corpus construction (pending)
- Quality assurance workflows (pending)
- Reproducible build processes (in development)
No canonical dataset release exists yet. The first conformant reference corpus is planned as the primary deliverable for the next milestone.
Datasets produced by this project are intended to be reproducible from documented upstream sources using deterministic build pipelines.
This repository includes complete Chapter 1 reference datasets for selected biblical texts. These are provided as authoritative examples of the schema in practice and as validation targets for tooling.
-
Genesis Chapter 1 (Hebrew/English)
Hebrew: Westminster Leningrad Codex (WLC)
English: King James Version (1769) -
John Chapter 1 (Greek/English)
Greek: SBL Greek New Testament (SBLGNT)
English: King James Version (1769)
Each reference dataset includes:
- Structured verse text
- Tokenized text with positions
- Morphological and lexical annotations
- Strong’s identifiers where applicable
- Alignment groups between languages
- Explicit handling of punctuation and markers
These datasets are intended to:
- Demonstrate correct schema usage across realistic text spans
- Provide reproducible examples for documentation and study
- Serve as test fixtures for validators and downstream tools
- Illustrate alignment behavior across languages
- Enable early experimentation before the full corpus is released
They are not intended to represent the complete dataset or a canonical release.
Reference datasets may differ from the eventual full corpus in:
- Coverage (limited to Chapter 1 only)
- Source normalization details
- Annotation completeness
- Alignment refinements
- Metadata richness
Future releases may update or replace these examples as the dataset production pipeline matures.
Reference datasets are stored within the examples/ directory following the
standard repository layout. See the examples/README for details.
The planned initial dataset will incorporate material derived from multiple third-party textual resources, including:
- Open Scriptures Hebrew Bible (OSHB)
- SBL Greek New Testament (SBLGNT)
- King James Version (1769) with Strong’s numbers
- Strong’s Hebrew Dictionary (1890)
- Strong’s Greek Dictionary (1890)
Several of these resources are distributed via modules from the CrossWire Bible Society (SWORD Project). The SBLGNT source is distributed separately via the Faithlife GitHub repository.
Comprehensive source attribution, provenance, licensing terms, and required acknowledgments are documented in ATTRIBUTION.md.
Upstream sources are documented in:
- docs/SOURCE_INDEX.md
- source_texts/*/SOURCE.md
Raw source artifacts are not stored in this repository due to licensing, size, and reproducibility considerations.
If you use this dataset in research, please cite it using the information provided in CITATION.cff.
If referencing the schema or framework without a dataset release, cite the repository itself.
The verse is the smallest textual unit that is:
- Stable across manuscript traditions
- Widely recognized by humans
- Used by nearly all Biblical software
- Practical for cross-language alignment
Smaller units (sentences, clauses) vary significantly across editions and languages.
Many Biblical languages encode meaning within word-internal elements:
- Hebrew prefixes and suffixes
- Greek inflectional endings
- Compound constructions
Morpheme-level modeling allows accurate linguistic analysis and precise cross-language alignment.
The goal is a neutral base layer describing textual facts only.
Interpretive datasets can be built on top without constraining the core schema or introducing bias.
Verse-level alignment provides a consistent, tractable scope while still capturing the vast majority of translation correspondences.
Cross-verse alignment introduces substantial ambiguity and complexity and can be layered on in future extensions if needed.
Version 0.2.0 focuses on a clean foundation using established scholarly editions.
Manuscript-level modeling is planned as a future extension once the core structure is stable.
Chapters are derivable from verse identifiers and are not strictly required for most computational tasks.
Making Chapter optional simplifies lightweight datasets while allowing materialization when useful for navigation or storage.
Strong’s numbers are a valuable reference system but:
- Not language-neutral
- Not comprehensive
- Not stable across scholarly traditions
Treating them as metadata preserves usefulness without constraining the dataset design.
Datasets may include an optional Token layer (separate from Word/Morpheme entities) for punctuation and legacy compatibility.
Word-Morpheme datasets may omit punctuation entirely.
No.
This project provides a structured representation of texts, not new textual content or translations.
External code and data contributions are not currently being accepted while the project architecture stabilizes.
The core schema (v0.2.0) is locked. Any future changes will follow a formal versioning and review process.
Feedback is welcome and encouraged. You can contribute by:
- Reporting bugs or errors
- Requesting clarifications
- Suggesting improvements
- Discussing design decisions
Please use:
- GitHub Issues for specific problems
- GitHub Discussions for questions or broader ideas
External contributions may be opened once the project reaches a stable architecture and governance model.
For full details, see CONTRIBUTING.md.
OpenBiblicalDataset is maintained as an independent open data infrastructure project.
Organizations or individuals interested in supporting its development, maintenance, or dataset production efforts are invited to review SUPPORT.md.
See LICENSE file for details.
OpenBiblicalDataset redistributes texts and reference data derived from sources believed to be in the public domain or otherwise legally redistributable.
Not all external source texts are licensed under MIT.
Licensing of redistributed source texts is independent of the license applied to the structural dataset produced by this project.
Where applicable:
- Original source texts may have their own public-domain status or licensing terms
- This project does not claim copyright over the underlying ancient texts
- Newly created structural data (identifiers, segmentation, annotations, alignments, etc.) is released under the MIT License
Users are responsible for ensuring compliance with any applicable terms when combining this dataset with other resources.
Specific source attributions and provenance information will be documented alongside each dataset release.
Redistribution rights vary by source and jurisdiction.
The initial dataset release is intended to be constructed from the following textual sources and reference works:
- Hebrew Bible — Open Scriptures Hebrew Bible (WLC-based)
- Greek New Testament — SBL Greek New Testament (Faithlife GitHub distribution, CC BY 4.0)
- English — King James Version (1769, CrossWire module)
- Lexical Reference — Strong’s Hebrew Dictionary and Strong’s Greek Dictionary (Strong’s Exhaustive Concordance, 1890, CrossWire modules)
This stack represents a foundational multilingual corpus suitable for alignment, linguistic analysis, and cross-language study.
Full provenance, licensing details, and required source acknowledgments are provided in ATTRIBUTION.md.
Users of redistributed datasets should review that document to ensure compliance with applicable terms.
This stack may evolve as production pipelines mature.