Skip to content

NDF Canonicalization Rules

paulmothapo edited this page Jan 3, 2026 · 1 revision

NDF Canonicalization Rules

Canonicalization ensures that semantically equivalent NDF documents serialize to identical byte sequences. This is critical for:

  • Deterministic output
  • Round-trip preservation
  • Diff tools and version control
  • Hash-based comparisons

Core Principles

  1. Determinism: Same data structure always produces same output
  2. Minimalism: Prefer most compact representation
  3. Readability: When compactness conflicts, prefer readability
  4. Consistency: Apply rules uniformly across all values

Serialization Rules

Key Ordering

Default: Preserve insertion order (as in JavaScript objects)

Canonical mode (sortKeys: true): Sort keys alphabetically (case-sensitive)

# Original (insertion order)
zebra: 1
apple: 2
banana: 3

# Canonical (sorted)
apple: 2
banana: 3
zebra: 1

Indentation

Standard: 2 spaces per level

Canonical: Always use 2 spaces (never tabs, never mixed)

# Non-canonical (tabs)
user:
		name: Alice

# Canonical (spaces)
user:
  name: Alice

Boolean Values

Canonical: Use yes/no (not true/false)

# Non-canonical
enabled: true
disabled: false

# Canonical
enabled: yes
disabled: no

Null Values

Canonical: Use null (not none or -)

# Non-canonical
optional: none
empty: -

# Canonical
optional: null
empty: null

String Quoting

Rule: Quote only when necessary (see ESCAPING.md for details)

Canonical decisions:

  • Don't quote if value is unquoted-safe
  • Use double quotes (") when quoting is needed
  • Escape only required characters
# Non-canonical (over-quoted)
name: "Alice"
age: "30"

# Canonical (minimal quoting)
name: Alice
age: 30

Array Representation

Rule: Use most compact form that fits within inlineThreshold (default: 60 chars)

Precedence:

  1. Inline comma-separated: tags: a, b, c
  2. Inline bracketed: tags: [a, b, c]
  3. Multiline with dashes: tags:\n - a\n - b\n - c
# Short array - inline comma
tags: python, ai, ml

# Medium array - inline bracket
coordinates: [10.5, 20.3, 30.1, 40.2]

# Long array - multiline
long_list:
  - item1
  - item2
  - item3
  - item4
  - item5

Canonical threshold: If total length ≤ 60 chars, use inline. Otherwise, multiline.

Object Representation

Rule: Prefer nested blocks over inline objects

Exception: Very small objects (1-2 keys, total < 40 chars) can be inline

# Small object - inline OK
meta: {version: "1.0", author: "John"}

# Larger object - nested
user:
  name: Alice
  email: alice@example.com
  settings:
    theme: dark
    notifications: yes

Multiline Strings

Rule: Use multiline (|) when string contains newlines

Indentation: Content indented 2 spaces relative to key

Trailing newlines: Strip trailing empty lines

# Canonical multiline
description: |
  Line 1
  Line 2
  Line 3

# Not canonical (escaped newlines)
description: "Line 1\nLine 2\nLine 3"

Numbers

Rule: Preserve original format (integer vs float)

Canonical decisions:

  • 30 not 30.0
  • 3.14 not 3.140
  • Scientific notation only when necessary: 1e10 not 10000000000
# Canonical
count: 30
pi: 3.14
large: 1e10

# Non-canonical
count: 30.0
pi: 3.140
large: 10000000000

References

Rule: Preserve reference definitions and usages

Canonical mode (includeReferences: false): Omit reference definitions, resolve all usages

# With references
$base: https://api.example.com
endpoint: $base/v1

# Canonical (resolved)
endpoint: https://api.example.com/v1

Type Hints

Rule: Preserve type hints if supported, otherwise strip

Canonical: Include type hints in output if parser supports them

# With type hint
timestamp: @time 2024-01-15T10:30:00Z

# Without type hint support
timestamp: 2024-01-15T10:30:00Z

Whitespace

Rule:

  • No trailing whitespace on lines
  • Single newline between top-level entries
  • No blank lines at start/end of document
# Canonical
key1: value1
key2: value2
key3: value3

# Non-canonical
key1: value1

key2: value2

key3: value3

Comments

Rule: Comments are not preserved in canonical form (they're metadata, not data)

Exception: If preserveComments option is enabled, preserve comments with their original formatting

# Original
name: Alice  # User's name
age: 30

# Canonical (comments stripped)
name: Alice
age: 30

Round-Trip Guarantees

What Preserves

Preserved:

  • Key-value pairs
  • Nested structure
  • Array order
  • String content (including newlines)
  • Number precision
  • Boolean values
  • Null values

What May Change

⚠️ May change (but semantically equivalent):

  • Key ordering (unless sortKeys: true)
  • Boolean representation (trueyes)
  • Null representation (nonenull)
  • Array formatting (inline vs multiline)
  • String quoting (if unnecessary)
  • Whitespace normalization
  • Comments (stripped by default)

Round-Trip Test

A document is round-trip safe if:

const original = parser.parse(text);
const serialized = parser.dumps(original);
const reparsed = parser.parse(serialized);
assert(deepEqual(original, reparsed));

Note: text !== serialized is expected and OK, as long as original === reparsed semantically.

Implementation

Default Options (Non-Canonical)

{
  indent: '  ',
  indentLevel: 0,
  inlineThreshold: 60,
  sortKeys: false,
  includeReferences: true
}

Canonical Options

{
  indent: '  ',
  indentLevel: 0,
  inlineThreshold: 60,
  sortKeys: true,        // Alphabetical key order
  includeReferences: false  // Resolve all references
}

Examples

Example 1: Array Formatting

Input:

tags: python ai ml

Canonical output:

tags: python, ai, ml

Reason: Comma-separated is more explicit and handles edge cases better.

Example 2: Boolean Normalization

Input:

enabled: true
disabled: false

Canonical output:

enabled: yes
disabled: no

Reason: yes/no is the preferred NDF boolean format.

Example 3: Key Ordering

Input:

zebra: 1
apple: 2
banana: 3

Canonical output (with sortKeys: true):

apple: 2
banana: 3
zebra: 1

Example 4: Reference Resolution

Input:

$base: https://api.example.com
endpoint: $base/v1

Canonical output (with includeReferences: false):

endpoint: https://api.example.com/v1

Best Practices

  1. Use canonical mode for:

    • Version control
    • Automated tooling
    • Hash-based comparisons
    • Testing
  2. Use default mode for:

    • Human editing
    • Preserving user formatting
    • Development workflows
  3. Always test round-trip when implementing serialization changes

  4. Document any canonicalization choices that affect user-visible behavior