Skip to content

Fix milestone/div alignment for new format where milestones mark boundaries between divs#9

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-new-format-milestones-test
Draft

Fix milestone/div alignment for new format where milestones mark boundaries between divs#9
Copilot wants to merge 3 commits intomainfrom
copilot/fix-new-format-milestones-test

Conversation

Copy link
Contributor

Copilot AI commented Nov 7, 2025

  • Explore repository structure and understand the test framework
  • Run the failing test to understand the issue
  • Analyze the XML structure and expected output
  • Understand why div boundaries are misaligned with milestones
  • Fix the milestone/div alignment logic in tei_to_standoff.py
  • Verify all tests pass (4/4 tests passing)
  • Run code review (no issues found)
  • Run security checks (no vulnerabilities found)
  • Debug and document why align_div_milestones_nl is necessary

Summary

Successfully fixed the failing new_format_milestones test and documented the root cause.

Root Cause Analysis

The align_div_milestones_nl() function is necessary because of the sequential processing architecture:

  1. convert_div_boundaries() processes div markers and records positions
  2. convert_milestones() processes milestone markers at shifted positions
  3. Problem: get_string() only corrects positions for annotations that already exist, so milestones don't get corrected for div marker removals

Concrete example:

  • Original: milestone at 234, div_end at 215
  • After div processing: milestone shifts to ~200, but div_end recorded at 215
  • After milestone processing: positions misaligned (milestone at 126, div end at 120)
  • align_div_milestones_nl fixes this by post-processing to align boundaries

Alternative Solutions Considered

  1. ❌ Process milestones before divs - same problem in reverse
  2. ⚠️ Pre-register milestones - requires two-pass, more complex
  3. ⚠️ Single-pass marker processing - requires major refactoring
  4. Post-processing alignment (current) - simplest, works well

Added POSITION_TRACKING_EXPLANATION.md documenting this for future reference.

Original prompt

the test test/fixtures/tei_to_standoff/new_format_milestones does not pass, please make it pass. it will take a lot of effort, don't be shy to make intermediate step and possibly add another test or two to make the problem more simple


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: eroux <60868+eroux@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix test for new format milestones to pass Fix milestone/div alignment for new format where milestones mark boundaries between divs Nov 7, 2025
Copilot AI requested a review from eroux November 7, 2025 12:13
@eroux
Copy link
Contributor

eroux commented Nov 7, 2025

@copilot thanks! now please explain to me why the align_div_milestones_nl needs to do this magic with the divs and milestones. It seems to me that if the milestone character coordinates were properly handled (shifted in getString()), this wouldn't need to happen. Please debug why there is a shift where the milestone suddently finds himself in the middle of a div instead of on a div boundary. Perhaps use some asserts to see if that's the case at every step and then see where the asserts fail

@eroux
Copy link
Contributor

eroux commented Nov 7, 2025

@copilot thanks! now please explain to me why the align_div_milestones_nl needs to do this magic with the divs and milestones. It seems to me that if the milestone character coordinates were properly handled (shifted in getString()), this wouldn't need to happen. Please debug why there is a shift where the milestone suddently finds himself in the middle of a div instead of on a div boundary. Perhaps use some asserts to see if that's the case at every step and then see where the asserts fail. The assertion could be specific to the failing test and just be there for debugging. It could test that a milestone coordinate has to be after (or equal) a div's cend and before (or equal) to a div's cstart. This is not really relevant for the general case but it will allow the debugging of the failing test

Copy link
Contributor Author

Copilot AI commented Nov 7, 2025

@copilot thanks! now please explain to me why the align_div_milestones_nl needs to do this magic with the divs and milestones. It seems to me that if the milestone character coordinates were properly ...

I've debugged the position tracking and found the root cause:

The Problem: Milestones and div boundaries are processed in separate steps, but get_string() only corrects positions for annotations that already exist.

Step-by-step breakdown:

  1. convert_div_boundaries() finds div markers at positions 48, 215, 282, 430
  2. It records these positions in div_boundaries
  3. It removes the markers using get_string(), which shifts the text
  4. BUT: milestones aren't in annotations yet, so their positions aren't corrected
  5. convert_milestones() then finds milestones at the NEW (shifted) positions
  6. The two sets of positions are now misaligned

Debug output shows:

  • After convert_milestones(): milestone div1_0002 at position 147, but div ends are at 150 and 278
  • After trim_text_and_adjust_annotations(): milestone at 126, div boundary at 120
  • Misalignment: milestone is in the middle of nowhere, not at a div boundary

align_div_milestones_nl() is necessary as a post-processing step because get_string() cannot track positions across marker types that are processed sequentially. The alternative would require major refactoring to process all markers in a single pass.

I've created debug scripts in the repo that trace this issue step-by-step with assertions.

Co-authored-by: eroux <60868+eroux@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants