Skip to content

fix(deserialization): Fix deserialization of special unicode characters#7

Merged
alesanfra merged 3 commits intoalesanfra:mainfrom
medvekoma:fix/unicode-deserialization
Feb 13, 2026
Merged

fix(deserialization): Fix deserialization of special unicode characters#7
alesanfra merged 3 commits intoalesanfra:mainfrom
medvekoma:fix/unicode-deserialization

Conversation

@medvekoma
Copy link
Contributor

@medvekoma medvekoma commented Feb 13, 2026

Description

Deserialization of special unicode characters (like ®) failed.
Changed the complex_test data to show the error and the fix.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Related Issues

Fixes #8

Testing

  • Tests pass locally
  • New tests added for new functionality
  • Documentation updated

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex code
  • Documentation updated
  • No new warnings generated
  • Tests added and passing
  • Conventional commit format used

@medvekoma medvekoma marked this pull request as draft February 13, 2026 15:32
@medvekoma medvekoma marked this pull request as ready for review February 13, 2026 16:32
@alesanfra alesanfra requested a review from Copilot February 13, 2026 20:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Unicode-related deserialization bug where parsing logic used character indices (from .chars().enumerate()) as if they were byte offsets, causing incorrect slicing/positioning when keys/values contain multi-byte characters (e.g., ®). It also extends the integration test corpus to cover the regression.

Changes:

  • Update deserialization scanning logic to iterate with char_indices() so returned positions are valid byte offsets for string slicing.
  • Add new smoke integration tests covering Unicode in keys, fields, values, and arrays.
  • Extend complex_test fixtures (.toon and .json) with an additional Unicode edge-case value (®).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/deserialization.rs Switches several parsers/scanners to byte-safe indices for Unicode correctness.
tests/integration/test_smoke.py Adds direct regression tests for Unicode round-tripping through dumps/loads.
tests/data/complex_test.toon Adds ® to the Unicode edge-cases in TOON fixture.
tests/data/complex_test.json Keeps JSON fixture in sync with the updated TOON fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alesanfra
Copy link
Owner

Excellent work, thanks a lot for this PR

@alesanfra alesanfra merged commit d3d2026 into alesanfra:main Feb 13, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deserialization of certain unicode characters fail

3 participants