Skip to content

Latest commit

 

History

History
889 lines (629 loc) · 16.3 KB

File metadata and controls

889 lines (629 loc) · 16.3 KB

docschema API Reference

Complete API documentation for docschema.

Table of Contents


SchemaBuilder

Fluent builder for defining extraction schemas.

Constructor

const schema = new SchemaBuilder(name);
Parameter Type Required Description
name string Yes Unique identifier for the schema

Methods

.describe(description)

Add a human-readable description to the schema.

schema.describe('Schema for extracting pension scheme rules');

.version(version)

Set the schema version (semver format recommended).

schema.version('1.0.0');

.string(fieldName)

Add a string field.

schema.string('clauseTitle');

Returns a FieldBuilder for chaining field options.

.number(fieldName)

Add a numeric field.

schema.number('benefitAmount');

.boolean(fieldName)

Add a boolean field.

schema.boolean('isActive');

.date(fieldName)

Add a date field. Accepts ISO 8601 strings or Date objects.

schema.date('effectiveFrom');

.currency(fieldName)

Add a currency field. Extracts numeric value and currency code.

schema.currency('liabilityLimit');
// Extracts: { value: 1000000, currency: 'GBP' }

.percentage(fieldName)

Add a percentage field (0-100).

schema.percentage('interestRate');

.enum(fieldName, values)

Add an enumerated field with allowed values.

schema.enum('status', ['active', 'pending', 'closed']);

.array(fieldName)

Add an array field.

schema.array('definitions');

.object(fieldName)

Add a nested object field.

schema.object('contactDetails');

.compare(fieldA, fieldB, compareFn, message)

Add cross-field validation.

schema.compare('startDate', 'endDate',
  (start, end) => new Date(start) < new Date(end),
  'Start date must be before end date'
);

.build()

Compile and return the schema object.

const compiledSchema = schema.build();

FieldBuilder Methods

These methods are available after adding a field:

.required()

Mark field as required.

schema.string('title').required();

.optional()

Mark field as optional (default).

schema.string('subtitle').optional();

.pattern(regex)

Add extraction pattern (regular expression).

schema.string('ruleNumber')
  .pattern(/Rule\s+(\d+(?:\.\d+)*)/i);

.hints(hintArray)

Add extraction hints for LLM-based extraction.

schema.string('partyName')
  .hints([
    'Look for company names after "between"',
    'Usually formatted as "Company Name Ltd"'
  ]);

.validate(validatorName)

Apply a named validator.

schema.string('postcode').validate('ukPostcode');

.min(value) / .max(value)

Set numeric bounds.

schema.number('age').min(18).max(120);
schema.percentage('rate').min(0).max(100);

.default(value)

Set default value if not extracted.

schema.string('status').default('pending');

SchemaExtractor

Extract structured data from documents using schemas.

Constructor

const extractor = new SchemaExtractor(options);
Option Type Default Description
schema object Required Compiled schema from SchemaBuilder
preserveSourceLocation boolean true Track source locations for citations
confidenceThreshold number 0.7 Minimum confidence for LLM extractions
enableCitations boolean true Generate citation objects

Methods

.extract(text, options)

Extract data from text using the schema.

const result = await extractor.extract(documentText, {
  mode: 'pattern',  // 'pattern' | 'llm' | 'hybrid'
  context: { documentType: 'contract' }
});

Returns:

{
  data: {
    // Extracted fields
    clauseNumber: '12.3',
    clauseTitle: 'Liability',
    // ...
  },
  citations: [
    {
      fieldName: 'clauseNumber',
      text: 'Clause 12.3',
      startOffset: 1420,
      endOffset: 1432,
      confidence: 1.0
    }
  ],
  metadata: {
    extractedAt: '2024-01-15T10:30:00Z',
    mode: 'pattern',
    schemaVersion: '1.0.0'
  },
  validation: {
    valid: true,
    errors: [],
    warnings: []
  }
}

.setLLMProvider(providerFn)

Set the LLM provider for AI-based extraction.

extractor.setLLMProvider(async (prompt) => {
  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      { role: 'system', content: prompt.system },
      { role: 'user', content: prompt.user }
    ]
  });
  return response.choices[0].message.content;
});

.extractField(text, fieldName, options)

Extract a single field.

const value = await extractor.extractField(text, 'clauseNumber');

.validate(data)

Validate extracted data against schema.

const validation = extractor.validate(data);
// { valid: boolean, errors: [], warnings: [] }

DocumentRegister

Version-controlled storage for extracted documents.

Constructor

const register = new DocumentRegister(options);
Option Type Default Description
name string Required Register name
storage Storage MemoryStorage Storage adapter
enableVersioning boolean true Track version history
enableAuditLog boolean true Log all operations

Methods

.add(data, metadata)

Add a new document to the register.

const entry = await register.add(extractedData, {
  effectiveFrom: '2024-01-01',
  category: 'benefits',
  tags: ['retirement'],
  source: 'trust-deed.pdf'
});

Returns:

{
  id: 'doc_abc123',
  version: 1,
  data: { /* extracted data */ },
  metadata: { /* provided metadata */ },
  createdAt: '2024-01-15T10:30:00Z'
}

.get(id, options)

Retrieve a document.

// Get current version
const doc = await register.get('doc_abc123');

// Get version effective at a specific date
const historicDoc = await register.get('doc_abc123', {
  asOf: '2023-06-15'
});

// Get specific version
const v2 = await register.get('doc_abc123', { version: 2 });

.update(id, data, metadata)

Update a document (creates new version).

await register.update('doc_abc123', newData, {
  effectiveFrom: '2024-06-01',
  changeReason: 'Regulatory update',
  changedBy: 'legal@example.com'
});

.delete(id, options)

Delete a document.

await register.delete('doc_abc123', {
  reason: 'Superseded',
  deletedBy: 'admin@example.com'
});

.getHistory(id)

Get full version history.

const history = await register.getHistory('doc_abc123');
// [{ version: 1, ... }, { version: 2, ... }]

.search(query, options)

Search documents.

const results = await register.search({
  category: 'benefits',
  'metadata.effectiveFrom': { $gte: '2024-01-01' }
}, {
  limit: 10,
  sort: { createdAt: -1 }
});

.list(options)

List all documents.

const docs = await register.list({
  category: 'benefits',
  limit: 50
});

.getAuditLog(options)

Retrieve audit log entries.

const log = register.getAuditLog({
  documentId: 'doc_abc123',
  action: 'update',
  from: '2024-01-01',
  to: '2024-12-31'
});

DocumentComparator

Compare documents and find differences.

Constructor

const comparator = new DocumentComparator(options);
Option Type Default Description
ignoreFields string[] [] Fields to exclude from comparison
numericTolerance number 0 Tolerance for numeric comparisons

Methods

.compare(docA, docB, options)

Compare two documents.

const diff = comparator.compare(docA, docB);

Returns:

{
  identical: false,
  differences: [
    {
      field: 'liabilityLimit',
      type: 'modified',
      valueA: 1000000,
      valueB: 2000000,
      delta: 1000000
    },
    {
      field: 'newClause',
      type: 'added',
      valueB: '...'
    },
    {
      field: 'oldClause',
      type: 'removed',
      valueA: '...'
    }
  ],
  summary: {
    added: 1,
    removed: 1,
    modified: 1
  }
}

.compareVersions(documentWithHistory)

Compare across version history.

const timeline = comparator.compareVersions(doc);
// Shows changes between each version

.findConflicts(documents, options)

Find conflicts across multiple documents.

const conflicts = comparator.findConflicts(docs, {
  conflictFields: ['benefitRate', 'eligibilityAge']
});

.generateDiffReport(diff, options)

Generate human-readable diff report.

const report = comparator.generateDiffReport(diff, {
  format: 'text'  // 'text' | 'html' | 'json'
});

ExtractionPipeline

Orchestrate the full extraction workflow.

Constructor

const pipeline = new ExtractionPipeline(options);
Option Type Default Description
name string Required Pipeline name
schema object Required Extraction schema
autoValidate boolean true Validate after extraction
autoRegister boolean false Auto-register extracted docs
requireHumanApproval boolean false Require approval for registration
confidenceThreshold number 0.85 Min confidence for auto-approval
onApprovalRequired function - Callback when approval needed
onError function - Error callback

Methods

.process(text, options)

Process a document through the pipeline.

const result = await pipeline.process(documentText, {
  source: 'document.pdf',
  category: 'contracts'
});

Returns:

{
  pipelineRunId: 'run_xyz789',
  status: 'completed' | 'failed' | 'awaiting_approval',
  stages: {
    extraction: { status: 'completed', data: {...}, duration: 1234 },
    validation: { status: 'completed', valid: true },
    approval: { status: 'pending', reason: 'Low confidence: 0.72' },
    registration: { status: 'pending' }
  },
  result: { /* final data if completed */ }
}

.approve(pipelineRunId, options)

Approve a pending extraction.

await pipeline.approve('run_xyz789', {
  approvedBy: 'reviewer@example.com',
  comments: 'Verified against source'
});

.reject(pipelineRunId, options)

Reject a pending extraction.

await pipeline.reject('run_xyz789', {
  rejectedBy: 'reviewer@example.com',
  reason: 'Data does not match source document'
});

.getStatus(pipelineRunId)

Get pipeline run status.

const status = pipeline.getStatus('run_xyz789');

.getStats()

Get pipeline statistics.

const stats = pipeline.getStats();
// { total: 100, completed: 95, failed: 3, pending: 2, successRate: '95%' }

ValidationEngine

Validate extracted data.

Constructor

const validator = new ValidationEngine(options);

Methods

.validate(data, schema)

Validate data against a schema.

const result = validator.validate(data, schema);
// { valid: boolean, errors: [], warnings: [] }

.registerValidator(name, fn)

Register a custom validator.

validator.registerValidator('ukNino', (value) => {
  const pattern = /^[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]$/i;
  return {
    valid: pattern.test(value),
    message: 'Invalid UK National Insurance number'
  };
});

Built-in Validators

Name Description
ukNino UK National Insurance number
ukPostcode UK postcode
sortCode UK bank sort code
companyNumber UK company registration number
email Email address
url Valid URL
phone Phone number
dateRange Valid date range

CitationTracker

Track and manage source citations.

Constructor

const tracker = new CitationTracker();

Methods

.addCitation(citation)

Add a citation.

tracker.addCitation({
  fieldName: 'clauseTitle',
  text: 'Liability Limitation',
  startOffset: 1420,
  endOffset: 1440,
  confidence: 0.95,
  source: 'contract.pdf'
});

.getCitations(fieldName)

Get citations for a field.

const citations = tracker.getCitations('clauseTitle');

.getAllCitations()

Get all citations.

const all = tracker.getAllCitations();

.toJSON()

Export citations as JSON.

const json = tracker.toJSON();

Storage Adapters

MemoryStorage

In-memory storage (default, for testing).

const { MemoryStorage } = require('docschema');
const storage = new MemoryStorage();

FileStorage

File-based storage.

const { FileStorage } = require('docschema');

const storage = new FileStorage({
  path: './data/documents',
  format: 'json'  // 'json' | 'yaml'
});

Custom Storage

Implement the Storage interface:

class DatabaseStorage {
  async write(key, value) { /* ... */ }
  async read(key) { /* ... */ }
  async list(options) { /* ... */ }
  async delete(key) { /* ... */ }
  async exists(key) { /* ... */ }
}

Built-in Schemas

pensionSchemeRule

Schema for pension scheme rules.

const { builtInSchemas } = require('docschema');
const schema = builtInSchemas.pensionSchemeRule;

Fields: ruleNumber, ruleTitle, ruleText, effectiveFrom, effectiveTo, definitions, crossReferences

dueDiligenceDocument

Schema for investment due diligence.

const schema = builtInSchemas.dueDiligenceDocument;

Fields: companyName, industry, revenue, ebitda, employees, riskFactors, keyFindings

esgReport

Schema for ESG/sustainability reports.

const schema = builtInSchemas.esgReport;

Fields: reportingPeriod, scope1Emissions, scope2Emissions, scope3Emissions, energyConsumption, waterUsage, wasteGenerated, employeeDiversity

contractClause

Schema for contract clauses.

const schema = builtInSchemas.contractClause;

Fields: clauseNumber, clauseTitle, clauseText, clauseType, parties, effectiveDate, terminationConditions

regulatoryCompliance

Schema for regulatory requirements.

const schema = builtInSchemas.regulatoryCompliance;

Fields: regulationId, requirement, applicability, deadline, complianceStatus, evidence


Parsers

TextParser

Parse plain text documents.

const { TextParser } = require('docschema');

const parser = new TextParser();
const sections = parser.parseSections(text, {
  sectionPattern: /^(?:Section|Article)\s+\d+/m
});

StructuredParser

Parse structured documents (JSON, XML).

const { StructuredParser } = require('docschema');

const parser = new StructuredParser();
const data = parser.parse(jsonText, { format: 'json' });

Error Handling

All async methods may throw these error types:

const { 
  SchemaError,
  ExtractionError,
  ValidationError,
  StorageError 
} = require('docschema');

try {
  await extractor.extract(text);
} catch (error) {
  if (error instanceof ValidationError) {
    console.log('Validation failed:', error.errors);
  } else if (error instanceof ExtractionError) {
    console.log('Extraction failed:', error.message);
  }
}

TypeScript Support

Full TypeScript definitions are included:

import { 
  SchemaBuilder, 
  SchemaExtractor,
  ExtractionResult,
  Citation 
} from 'docschema';

const result: ExtractionResult = await extractor.extract(text);
const citations: Citation[] = result.citations;