OpenBiblicalDataset

A unified, open, linguistically structured framework for constructing multilingual Biblical text datasets under a canonical, machine-readable schema.

It defines a canonical data model and supporting tooling for producing interoperable, linguistically annotated Biblical text corpora.

The initial focus is on Hebrew, Greek, and English, but the schema is designed to support additional languages and textual traditions.

The project is designed as a neutral foundation that can serve as a common standard for Biblical text computing.

Project Version: 0.2.0

⚠️ Dataset Status: No complete corpus is currently distributed. This repository provides the schema, tooling, documentation, and validated examples required to produce conformant datasets.

Getting Started

Although no full dataset is currently distributed, you can:

Review the formal schema in /spec/schema.md
Examine validated example outputs in /examples/
Run the validation tool on sample data:

  python tools/validate.py examples/

Use the schema to design or convert your own datasets
Prepare tooling for future corpus releases

↑ Back to contents

Documentation

For an overview of repository structure and documentation roles, see the Documentation Map.

↑ Back to contents

Versioning

This project tracks three independent version axes.

Component	Current Version	Scope	Changes When
Schema	0.2.0	Data structure specification	Structure or validation rules change
Dataset	— (not yet released)	Corpus content	Data is added, corrected, or rebuilt
Repository	0.2.0	Documentation and tooling	Project files change

Datasets declare the schema version they conform to. These evolve independently and should not be assumed to move in lockstep.

⚠️ No official dataset release is currently included in this repository. Validated example outputs are provided in /examples/.

↑ Back to contents

Purpose

This project provides a foundational data infrastructure layer for Biblical computing that supports:

Linguistic analysis
Cross-language alignment
Translation studies
Bible software development
Digital humanities research
AI training datasets

This project is not a translation, commentary, or theological resource. It models textual structure only.

↑ Back to contents

Intended Audience

This project is primarily intended for:

Biblical scholars and textual researchers
Computational linguists
Digital humanities practitioners
Bible software developers
AI/ML researchers working with religious texts

↑ Back to contents

Design Goals

Open and Redistributable - All datasets are legally redistributable (public domain or open academic resources)
Canonical Verse Anchoring - Verse IDs are stable across all languages and editions
Separation of Concerns - Canonical structure separated from edition-specific text segmentation
Deterministic Identifiers - Stable, machine-safe IDs for all entities
Language-Scoped Lemmas - Prevent cross-language collisions
External References as Metadata - Strong’s numbers attached to lemmas, not replacing them
Morpheme-Level Modeling - Supports Hebrew prefixes, suffixes, and Greek compounds
Interpretation-Neutral - No theology, semantics, or doctrinal tagging in the core dataset

↑ Back to contents

Non-Goals

The core dataset does not aim to provide:

Theological interpretation or doctrinal tagging
Commentary or study notes
New translations of Biblical texts
Manuscript transcription or paleographic data (future extension)
End-user Bible software features

These may be built as separate layers on top of the core schema.

↑ Back to contents

Architecture Overview

The data flows through layered entities:

TextSource (edition)
   ↓
Verse (canonical anchor)
   ↓
Word (orthographic unit)
   ↓
Morpheme (linguistic unit)
   ↓
Lemma (dictionary form)
   ↓
Alignment (cross-language mapping)

Each layer represents a progressively more abstract linguistic unit derived from the underlying textual source.

Notes:

Verses exist independently of any specific text edition
Words and morphemes are edition-specific
Alignment operates primarily at the morpheme level (or word level where morphemes are not modeled) within a verse.

↑ Back to contents

Repository Layout

.
├── ATTRIBUTION.md
├── CHANGELOG.md
├── CITATION.cff
├── LICENSE
├── README.md
├── SUPPORT.md
├── VERSION
│
├── data/
│   ├── core/
│   ├── texts/
│   ├── lemmas/
│   └── alignments/
│
├── examples/
│
├── spec/
│   └── schema.md
│
├── tools/
│   └── validate.py
│
├── scripts/
└── docs/

The data/ directory defines the canonical dataset structure but does not currently contain a complete corpus.

↑ Back to contents

Project Status

Phase 1 — Core Dataset Architecture (Complete)

The foundational data model and conformance rules are finalized.

Schema v0.2.0 finalized and locked
Canonical reference system defined
Word and morpheme segmentation model defined
Lemma system defined
Alignment specification defined
Conformance rules established

Schema v0.1.0 served as the initial architectural baseline. Schema v0.2.0 introduces clarifications, refinements, and supporting specifications without altering the core design.

Phase 2 — Supporting Infrastructure (Substantially Complete)

Materials required to produce conformant datasets have been developed:

Detailed specification documents (complete)
Reference examples (complete)
Validation rules and tooling (complete)
Repository documentation (complete)

Automated production pipelines are currently under development.

Phase 3 — Dataset Production (Beginning)

Focus is shifting toward generating the first fully conformant datasets:

Data production pipelines (in development)
Reference corpus construction (pending)
Quality assurance workflows (pending)
Reproducible build processes (in development)

No canonical dataset release exists yet. The first conformant reference corpus is planned as the primary deliverable for the next milestone.

Reproducibility

Datasets produced by this project are intended to be reproducible from documented upstream sources using deterministic build pipelines.

↑ Back to contents

Reference Datasets

This repository includes complete Chapter 1 reference datasets for selected biblical texts. These are provided as authoritative examples of the schema in practice and as validation targets for tooling.

Included Reference Sets

Genesis Chapter 1 (Hebrew/English)
Hebrew: Westminster Leningrad Codex (WLC)
English: King James Version (1769)
John Chapter 1 (Greek/English)
Greek: SBL Greek New Testament (SBLGNT)
English: King James Version (1769)

Each reference dataset includes:

Structured verse text
Tokenized text with positions
Morphological and lexical annotations
Strong’s identifiers where applicable
Alignment groups between languages
Explicit handling of punctuation and markers

Purpose

These datasets are intended to:

Demonstrate correct schema usage across realistic text spans
Provide reproducible examples for documentation and study
Serve as test fixtures for validators and downstream tools
Illustrate alignment behavior across languages
Enable early experimentation before the full corpus is released

They are not intended to represent the complete dataset or a canonical release.

Limitations

Reference datasets may differ from the eventual full corpus in:

Coverage (limited to Chapter 1 only)
Source normalization details
Annotation completeness
Alignment refinements
Metadata richness

Future releases may update or replace these examples as the dataset production pipeline matures.

Location

Reference datasets are stored within the examples/ directory following the standard repository layout. See the examples/README for details.

↑ Back to contents

Data Sources

The planned initial dataset will incorporate material derived from multiple third-party textual resources, including:

Open Scriptures Hebrew Bible (OSHB)
SBL Greek New Testament (SBLGNT)
King James Version (1769) with Strong’s numbers
Strong’s Hebrew Dictionary (1890)
Strong’s Greek Dictionary (1890)

Several of these resources are distributed via modules from the CrossWire Bible Society (SWORD Project). The SBLGNT source is distributed separately via the Faithlife GitHub repository.

Comprehensive source attribution, provenance, licensing terms, and required acknowledgments are documented in ATTRIBUTION.md.

↑ Back to contents

Data Provenance

Upstream sources are documented in:

docs/SOURCE_INDEX.md
source_texts/*/SOURCE.md

Raw source artifacts are not stored in this repository due to licensing, size, and reproducibility considerations.

↑ Back to contents

Citation

If you use this dataset in research, please cite it using the information provided in CITATION.cff.

If referencing the schema or framework without a dataset release, cite the repository itself.

↑ Back to contents

Frequently Asked Questions

Why verse-level anchoring?

The verse is the smallest textual unit that is:

Stable across manuscript traditions
Widely recognized by humans
Used by nearly all Biblical software
Practical for cross-language alignment

Smaller units (sentences, clauses) vary significantly across editions and languages.

Why morpheme-level modeling instead of word-level only?

Many Biblical languages encode meaning within word-internal elements:

Hebrew prefixes and suffixes
Greek inflectional endings
Compound constructions

Morpheme-level modeling allows accurate linguistic analysis and precise cross-language alignment.

Why not include theology, semantics, or interpretation?

The goal is a neutral base layer describing textual facts only.

Interpretive datasets can be built on top without constraining the core schema or introducing bias.

Why are alignments restricted to within a verse?

Verse-level alignment provides a consistent, tractable scope while still capturing the vast majority of translation correspondences.

Cross-verse alignment introduces substantial ambiguity and complexity and can be layered on in future extensions if needed.

Why is there no manuscript layer yet?

Version 0.2.0 focuses on a clean foundation using established scholarly editions.

Manuscript-level modeling is planned as a future extension once the core structure is stable.

Why is the Chapter entity optional?

Chapters are derivable from verse identifiers and are not strictly required for most computational tasks.

Making Chapter optional simplifies lightweight datasets while allowing materialization when useful for navigation or storage.

Why attach Strong’s numbers to lemmas instead of using them as IDs?

Strong’s numbers are a valuable reference system but:

Not language-neutral
Not comprehensive
Not stable across scholarly traditions

Treating them as metadata preserves usefulness without constraining the dataset design.

How are punctuation and non-word tokens handled?

Datasets may include an optional Token layer (separate from Word/Morpheme entities) for punctuation and legacy compatibility.

Word-Morpheme datasets may omit punctuation entirely.

Is this intended to replace existing Biblical texts or editions?

No.

This project provides a structured representation of texts, not new textual content or translations.

↑ Back to contents

Contributing

External code and data contributions are not currently being accepted while the project architecture stabilizes.

The core schema (v0.2.0) is locked. Any future changes will follow a formal versioning and review process.

Feedback is welcome and encouraged. You can contribute by:

Reporting bugs or errors
Requesting clarifications
Suggesting improvements
Discussing design decisions

Please use:

GitHub Issues for specific problems
GitHub Discussions for questions or broader ideas

External contributions may be opened once the project reaches a stable architecture and governance model.

For full details, see CONTRIBUTING.md.

↑ Back to contents

Project Sustainability

OpenBiblicalDataset is maintained as an independent open data infrastructure project.

Organizations or individuals interested in supporting its development, maintenance, or dataset production efforts are invited to review SUPPORT.md.

↑ Back to contents

License

See LICENSE file for details.

↑ Back to contents

Source Data Licensing

OpenBiblicalDataset redistributes texts and reference data derived from sources believed to be in the public domain or otherwise legally redistributable.

Not all external source texts are licensed under MIT.

Licensing of redistributed source texts is independent of the license applied to the structural dataset produced by this project.

Where applicable:

Original source texts may have their own public-domain status or licensing terms
This project does not claim copyright over the underlying ancient texts
Newly created structural data (identifiers, segmentation, annotations, alignments, etc.) is released under the MIT License

Users are responsible for ensuring compliance with any applicable terms when combining this dataset with other resources.

Specific source attributions and provenance information will be documented alongside each dataset release.

Redistribution rights vary by source and jurisdiction.

↑ Back to contents

Planned Initial Data Stack

The initial dataset release is intended to be constructed from the following textual sources and reference works:

Hebrew Bible — Open Scriptures Hebrew Bible (WLC-based)
Greek New Testament — SBL Greek New Testament (Faithlife GitHub distribution, CC BY 4.0)
English — King James Version (1769, CrossWire module)
Lexical Reference — Strong’s Hebrew Dictionary and Strong’s Greek Dictionary (Strong’s Exhaustive Concordance, 1890, CrossWire modules)

This stack represents a foundational multilingual corpus suitable for alignment, linguistic analysis, and cross-language study.

Full provenance, licensing details, and required source acknowledgments are provided in ATTRIBUTION.md.

Users of redistributed datasets should review that document to ensure compliance with applicable terms.

This stack may evolve as production pipelines mature.

↑ Back to contents

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
docs		docs
examples		examples
scripts		scripts
source_texts		source_texts
spec		spec
tools		tools
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DERIVATION.md		DERIVATION.md
LICENSE		LICENSE
README.md		README.md
SUPPORT.md		SUPPORT.md
VERSION		VERSION

Folders and files

Latest commit

History

Repository files navigation

OpenBiblicalDataset

Contents

Getting Started

Documentation

Versioning

Purpose

Intended Audience

Design Goals

Non-Goals

Architecture Overview

Repository Layout

Project Status

Reproducibility

Reference Datasets

Included Reference Sets

Purpose

Limitations

Location

Data Sources

Data Provenance

Citation

Frequently Asked Questions

Why verse-level anchoring?

Why morpheme-level modeling instead of word-level only?

Why not include theology, semantics, or interpretation?

Why are alignments restricted to within a verse?

Why is there no manuscript layer yet?

Why is the Chapter entity optional?

Why attach Strong’s numbers to lemmas instead of using them as IDs?

How are punctuation and non-word tokens handled?

Is this intended to replace existing Biblical texts or editions?

Contributing

Project Sustainability

License

Source Data Licensing

Planned Initial Data Stack

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages