Complete API documentation for docschema.
- SchemaBuilder
- SchemaExtractor
- DocumentRegister
- DocumentComparator
- ExtractionPipeline
- ValidationEngine
- CitationTracker
- Storage Adapters
- Built-in Schemas
- Parsers
Fluent builder for defining extraction schemas.
const schema = new SchemaBuilder(name);| Parameter | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Unique identifier for the schema |
Add a human-readable description to the schema.
schema.describe('Schema for extracting pension scheme rules');Set the schema version (semver format recommended).
schema.version('1.0.0');Add a string field.
schema.string('clauseTitle');Returns a FieldBuilder for chaining field options.
Add a numeric field.
schema.number('benefitAmount');Add a boolean field.
schema.boolean('isActive');Add a date field. Accepts ISO 8601 strings or Date objects.
schema.date('effectiveFrom');Add a currency field. Extracts numeric value and currency code.
schema.currency('liabilityLimit');
// Extracts: { value: 1000000, currency: 'GBP' }Add a percentage field (0-100).
schema.percentage('interestRate');Add an enumerated field with allowed values.
schema.enum('status', ['active', 'pending', 'closed']);Add an array field.
schema.array('definitions');Add a nested object field.
schema.object('contactDetails');Add cross-field validation.
schema.compare('startDate', 'endDate',
(start, end) => new Date(start) < new Date(end),
'Start date must be before end date'
);Compile and return the schema object.
const compiledSchema = schema.build();These methods are available after adding a field:
Mark field as required.
schema.string('title').required();Mark field as optional (default).
schema.string('subtitle').optional();Add extraction pattern (regular expression).
schema.string('ruleNumber')
.pattern(/Rule\s+(\d+(?:\.\d+)*)/i);Add extraction hints for LLM-based extraction.
schema.string('partyName')
.hints([
'Look for company names after "between"',
'Usually formatted as "Company Name Ltd"'
]);Apply a named validator.
schema.string('postcode').validate('ukPostcode');Set numeric bounds.
schema.number('age').min(18).max(120);
schema.percentage('rate').min(0).max(100);Set default value if not extracted.
schema.string('status').default('pending');Extract structured data from documents using schemas.
const extractor = new SchemaExtractor(options);| Option | Type | Default | Description |
|---|---|---|---|
schema |
object | Required | Compiled schema from SchemaBuilder |
preserveSourceLocation |
boolean | true |
Track source locations for citations |
confidenceThreshold |
number | 0.7 |
Minimum confidence for LLM extractions |
enableCitations |
boolean | true |
Generate citation objects |
Extract data from text using the schema.
const result = await extractor.extract(documentText, {
mode: 'pattern', // 'pattern' | 'llm' | 'hybrid'
context: { documentType: 'contract' }
});Returns:
{
data: {
// Extracted fields
clauseNumber: '12.3',
clauseTitle: 'Liability',
// ...
},
citations: [
{
fieldName: 'clauseNumber',
text: 'Clause 12.3',
startOffset: 1420,
endOffset: 1432,
confidence: 1.0
}
],
metadata: {
extractedAt: '2024-01-15T10:30:00Z',
mode: 'pattern',
schemaVersion: '1.0.0'
},
validation: {
valid: true,
errors: [],
warnings: []
}
}Set the LLM provider for AI-based extraction.
extractor.setLLMProvider(async (prompt) => {
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: prompt.system },
{ role: 'user', content: prompt.user }
]
});
return response.choices[0].message.content;
});Extract a single field.
const value = await extractor.extractField(text, 'clauseNumber');Validate extracted data against schema.
const validation = extractor.validate(data);
// { valid: boolean, errors: [], warnings: [] }Version-controlled storage for extracted documents.
const register = new DocumentRegister(options);| Option | Type | Default | Description |
|---|---|---|---|
name |
string | Required | Register name |
storage |
Storage | MemoryStorage | Storage adapter |
enableVersioning |
boolean | true |
Track version history |
enableAuditLog |
boolean | true |
Log all operations |
Add a new document to the register.
const entry = await register.add(extractedData, {
effectiveFrom: '2024-01-01',
category: 'benefits',
tags: ['retirement'],
source: 'trust-deed.pdf'
});Returns:
{
id: 'doc_abc123',
version: 1,
data: { /* extracted data */ },
metadata: { /* provided metadata */ },
createdAt: '2024-01-15T10:30:00Z'
}Retrieve a document.
// Get current version
const doc = await register.get('doc_abc123');
// Get version effective at a specific date
const historicDoc = await register.get('doc_abc123', {
asOf: '2023-06-15'
});
// Get specific version
const v2 = await register.get('doc_abc123', { version: 2 });Update a document (creates new version).
await register.update('doc_abc123', newData, {
effectiveFrom: '2024-06-01',
changeReason: 'Regulatory update',
changedBy: 'legal@example.com'
});Delete a document.
await register.delete('doc_abc123', {
reason: 'Superseded',
deletedBy: 'admin@example.com'
});Get full version history.
const history = await register.getHistory('doc_abc123');
// [{ version: 1, ... }, { version: 2, ... }]Search documents.
const results = await register.search({
category: 'benefits',
'metadata.effectiveFrom': { $gte: '2024-01-01' }
}, {
limit: 10,
sort: { createdAt: -1 }
});List all documents.
const docs = await register.list({
category: 'benefits',
limit: 50
});Retrieve audit log entries.
const log = register.getAuditLog({
documentId: 'doc_abc123',
action: 'update',
from: '2024-01-01',
to: '2024-12-31'
});Compare documents and find differences.
const comparator = new DocumentComparator(options);| Option | Type | Default | Description |
|---|---|---|---|
ignoreFields |
string[] | [] |
Fields to exclude from comparison |
numericTolerance |
number | 0 |
Tolerance for numeric comparisons |
Compare two documents.
const diff = comparator.compare(docA, docB);Returns:
{
identical: false,
differences: [
{
field: 'liabilityLimit',
type: 'modified',
valueA: 1000000,
valueB: 2000000,
delta: 1000000
},
{
field: 'newClause',
type: 'added',
valueB: '...'
},
{
field: 'oldClause',
type: 'removed',
valueA: '...'
}
],
summary: {
added: 1,
removed: 1,
modified: 1
}
}Compare across version history.
const timeline = comparator.compareVersions(doc);
// Shows changes between each versionFind conflicts across multiple documents.
const conflicts = comparator.findConflicts(docs, {
conflictFields: ['benefitRate', 'eligibilityAge']
});Generate human-readable diff report.
const report = comparator.generateDiffReport(diff, {
format: 'text' // 'text' | 'html' | 'json'
});Orchestrate the full extraction workflow.
const pipeline = new ExtractionPipeline(options);| Option | Type | Default | Description |
|---|---|---|---|
name |
string | Required | Pipeline name |
schema |
object | Required | Extraction schema |
autoValidate |
boolean | true |
Validate after extraction |
autoRegister |
boolean | false |
Auto-register extracted docs |
requireHumanApproval |
boolean | false |
Require approval for registration |
confidenceThreshold |
number | 0.85 |
Min confidence for auto-approval |
onApprovalRequired |
function | - | Callback when approval needed |
onError |
function | - | Error callback |
Process a document through the pipeline.
const result = await pipeline.process(documentText, {
source: 'document.pdf',
category: 'contracts'
});Returns:
{
pipelineRunId: 'run_xyz789',
status: 'completed' | 'failed' | 'awaiting_approval',
stages: {
extraction: { status: 'completed', data: {...}, duration: 1234 },
validation: { status: 'completed', valid: true },
approval: { status: 'pending', reason: 'Low confidence: 0.72' },
registration: { status: 'pending' }
},
result: { /* final data if completed */ }
}Approve a pending extraction.
await pipeline.approve('run_xyz789', {
approvedBy: 'reviewer@example.com',
comments: 'Verified against source'
});Reject a pending extraction.
await pipeline.reject('run_xyz789', {
rejectedBy: 'reviewer@example.com',
reason: 'Data does not match source document'
});Get pipeline run status.
const status = pipeline.getStatus('run_xyz789');Get pipeline statistics.
const stats = pipeline.getStats();
// { total: 100, completed: 95, failed: 3, pending: 2, successRate: '95%' }Validate extracted data.
const validator = new ValidationEngine(options);Validate data against a schema.
const result = validator.validate(data, schema);
// { valid: boolean, errors: [], warnings: [] }Register a custom validator.
validator.registerValidator('ukNino', (value) => {
const pattern = /^[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]$/i;
return {
valid: pattern.test(value),
message: 'Invalid UK National Insurance number'
};
});| Name | Description |
|---|---|
ukNino |
UK National Insurance number |
ukPostcode |
UK postcode |
sortCode |
UK bank sort code |
companyNumber |
UK company registration number |
email |
Email address |
url |
Valid URL |
phone |
Phone number |
dateRange |
Valid date range |
Track and manage source citations.
const tracker = new CitationTracker();Add a citation.
tracker.addCitation({
fieldName: 'clauseTitle',
text: 'Liability Limitation',
startOffset: 1420,
endOffset: 1440,
confidence: 0.95,
source: 'contract.pdf'
});Get citations for a field.
const citations = tracker.getCitations('clauseTitle');Get all citations.
const all = tracker.getAllCitations();Export citations as JSON.
const json = tracker.toJSON();In-memory storage (default, for testing).
const { MemoryStorage } = require('docschema');
const storage = new MemoryStorage();File-based storage.
const { FileStorage } = require('docschema');
const storage = new FileStorage({
path: './data/documents',
format: 'json' // 'json' | 'yaml'
});Implement the Storage interface:
class DatabaseStorage {
async write(key, value) { /* ... */ }
async read(key) { /* ... */ }
async list(options) { /* ... */ }
async delete(key) { /* ... */ }
async exists(key) { /* ... */ }
}Schema for pension scheme rules.
const { builtInSchemas } = require('docschema');
const schema = builtInSchemas.pensionSchemeRule;Fields: ruleNumber, ruleTitle, ruleText, effectiveFrom, effectiveTo, definitions, crossReferences
Schema for investment due diligence.
const schema = builtInSchemas.dueDiligenceDocument;Fields: companyName, industry, revenue, ebitda, employees, riskFactors, keyFindings
Schema for ESG/sustainability reports.
const schema = builtInSchemas.esgReport;Fields: reportingPeriod, scope1Emissions, scope2Emissions, scope3Emissions, energyConsumption, waterUsage, wasteGenerated, employeeDiversity
Schema for contract clauses.
const schema = builtInSchemas.contractClause;Fields: clauseNumber, clauseTitle, clauseText, clauseType, parties, effectiveDate, terminationConditions
Schema for regulatory requirements.
const schema = builtInSchemas.regulatoryCompliance;Fields: regulationId, requirement, applicability, deadline, complianceStatus, evidence
Parse plain text documents.
const { TextParser } = require('docschema');
const parser = new TextParser();
const sections = parser.parseSections(text, {
sectionPattern: /^(?:Section|Article)\s+\d+/m
});Parse structured documents (JSON, XML).
const { StructuredParser } = require('docschema');
const parser = new StructuredParser();
const data = parser.parse(jsonText, { format: 'json' });All async methods may throw these error types:
const {
SchemaError,
ExtractionError,
ValidationError,
StorageError
} = require('docschema');
try {
await extractor.extract(text);
} catch (error) {
if (error instanceof ValidationError) {
console.log('Validation failed:', error.errors);
} else if (error instanceof ExtractionError) {
console.log('Extraction failed:', error.message);
}
}Full TypeScript definitions are included:
import {
SchemaBuilder,
SchemaExtractor,
ExtractionResult,
Citation
} from 'docschema';
const result: ExtractionResult = await extractor.extract(text);
const citations: Citation[] = result.citations;