Skip to content

augmented_diff.py should emit multiple actions if an element is modified several times in a single replication file #1

@jake-low

Description

@jake-low

Minutely .osc replication files contain all of the changes to OSM in a roughly one-minute window. They are structured as sequences of edits, not diffs—in other words, if an element is modified several times within a one-minute window, the replication file will contain multiple versions of that element. This can happen in two ways:

  1. When a single changeset modifies the same element multiple times. Some editing software (for example StreetComplete) will open relatively long-lived changesets and upload each action the user takes as a change, so multiple modifications to the same element will occur if the user makes several successive edits, or makes an edit and then later uses "undo" to reverse that edit.
  2. When multiple changesets happen to modify the same element within a short time window (either by a single user or multiple users).

augmented_diff.py turns an .osc file into an augmented diff, using a local OSMExpress database that mirrors the planet file. For each replication file, the database is assumed to contain the previous (old) version of all elements. The augmented diff is constructed by combining each new version of an element from the replication file with its previous version from the OSMExpress database. Afterwards, the database is updated from the replication file, so that the invariant holds for the next replication file.

Currently, augmented_diff.py does not emit intermediate versions of an element in the augmented diff. In its first pass over the input file, it loads all elements into a dictionary keyed by the elements' IDs as strings (e.g. way/12345). If multiple versions are present, only the latest version is kept. The output augmented diff therefore contains at most one <action> for each element, representing the final diff of the element before and after all edits in the replication file are applied.

This isn't wrong on its own, but it is lossy if you then want to split up the actions into one augmented diff file per changeset. If an element was edited several times by different changesets within one replication file, only the last changeset's augmented diff will contain any reference to that element. When we visualize these changes in OSMCha, the result is that some changesets appear to be missing edits, because a later changeset that happened to fall in the same replication file "stole" those edits when we divided the actions up.

Here is an example. Changeset 164430940 added crossing:island=no to five highway=crossing nodes. But the augmented diff for that changeset only contains three actions, because two of the five nodes (642814890 and 7066369165) were modified again in changesets 164430943 and 164430946, which both fell within the same replication file (6539963). This results in all three changesets having incorrect augmented diffs:

  • 164430940.adiff is missing two of the five edits it should have
  • 164430943.adiff is missing entirely (404 not found), because all of its elements were again modified in 164430946, so it ended up with no content of its own
  • 164430946.adiff incorrectly claims that it added crossing:island and tactile_paving to several nodes, when in fact those changes were made in the two previous changesets, and 164430946 only added kerb=* tags.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions