Inconsistent IDs lead to distributed computing woes.

When trying to work with these data via Dataflow, I noticed a few things:

- the ID field key is inconsistent between files. it is `id` in minhash and signals, `doc_id` in duplicates.
- IDs are not present as an explicit field in documents. They must be reconstructed from the file path and line number.

This creates a lot of unnecessary friction when working with big data pipelines, since line number is not usually available.  I'm finding myself writing a custom reader (sort of a bummer if you've ever had to do it). 

For future data releases, please consider embedding a consistent key between all file groups for easier joining at scale.  Just a UUID would be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent IDs lead to distributed computing woes. #111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent IDs lead to distributed computing woes. #111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions