-
-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Minutely .osc replication files contain all of the changes to OSM in a roughly one-minute window. They are structured as sequences of edits, not diffs—in other words, if an element is modified several times within a one-minute window, the replication file will contain multiple versions of that element. This can happen in two ways:
- When a single changeset modifies the same element multiple times. Some editing software (for example StreetComplete) will open relatively long-lived changesets and upload each action the user takes as a change, so multiple modifications to the same element will occur if the user makes several successive edits, or makes an edit and then later uses "undo" to reverse that edit.
- When multiple changesets happen to modify the same element within a short time window (either by a single user or multiple users).
augmented_diff.py turns an .osc file into an augmented diff, using a local OSMExpress database that mirrors the planet file. For each replication file, the database is assumed to contain the previous (old) version of all elements. The augmented diff is constructed by combining each new version of an element from the replication file with its previous version from the OSMExpress database. Afterwards, the database is updated from the replication file, so that the invariant holds for the next replication file.
Currently, augmented_diff.py does not emit intermediate versions of an element in the augmented diff. In its first pass over the input file, it loads all elements into a dictionary keyed by the elements' IDs as strings (e.g. way/12345). If multiple versions are present, only the latest version is kept. The output augmented diff therefore contains at most one <action> for each element, representing the final diff of the element before and after all edits in the replication file are applied.
This isn't wrong on its own, but it is lossy if you then want to split up the actions into one augmented diff file per changeset. If an element was edited several times by different changesets within one replication file, only the last changeset's augmented diff will contain any reference to that element. When we visualize these changes in OSMCha, the result is that some changesets appear to be missing edits, because a later changeset that happened to fall in the same replication file "stole" those edits when we divided the actions up.
Here is an example. Changeset 164430940 added crossing:island=no to five highway=crossing nodes. But the augmented diff for that changeset only contains three actions, because two of the five nodes (642814890 and 7066369165) were modified again in changesets 164430943 and 164430946, which both fell within the same replication file (6539963). This results in all three changesets having incorrect augmented diffs:
- 164430940.adiff is missing two of the five edits it should have
- 164430943.adiff is missing entirely (404 not found), because all of its elements were again modified in 164430946, so it ended up with no content of its own
- 164430946.adiff incorrectly claims that it added
crossing:islandandtactile_pavingto several nodes, when in fact those changes were made in the two previous changesets, and 164430946 only addedkerb=*tags.