Fixing Russian Legal Code Article Number Parsing #2
mikhashev
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem
When importing Russian legal documents from official government websites, we encountered a puzzle: the article numbers were ambiguous. The number
531could mean article 531, or it could be sub-article 53.1. The number12316could be article 12316, or sub-article 123.16. Even more confusing,1061appeared right after article 104, with no 105 or 106 in sight.The Mystery
Russian legal codes have a complex structure. Each code has main articles (Civil Code Part 1 has 1-453), but also sub-articles like 53.1, appendices like 12316-1, and sometimes entire articles get deleted by law while their sub-articles remain.
The government websites don't use dots or special formatting. Everything is run together. Our import system was getting confused, leaving some articles as
531when they should be53.1, or breaking12316-1into123.1.6-1instead of123.16-1.The Solution
We built a three-part detective system:
Context Clues - Look at neighboring articles. If
531appears between53and54, it must be53.1.Known Boundaries - Each code has valid ranges. Civil Code only goes to 453, so
601must be60.1.Deleted Article Handling - When articles 105-106 were deleted by law, their sub-articles remained. The system learned that
1061after104means106.1- a sub-article of a deleted main article.The Result
All 23 Russian legal codes (~6,700 articles) now import correctly:
531becomes53.112316becomes123.1612316-1becomes123.16-11061becomes106.1602becomes60.2The system can now handle the complex and ambiguous encoding used in official Russian government sources.
Beta Was this translation helpful? Give feedback.
All reactions