I work at DBT and have been improving an ETL pipeline for gov.uk content we have based on parameters the department needs. I'd like to configure it so it ingests and overwrites data that's changed rather than ingesting everything over and over again.
My plan is:
- Use the search API and the
updated_at field to return results changed in the last few days
- Use the content API to fetch the content, recursing through related pages to pick up collection children etc, again filtering on
updated_at for new content
From the other side of the API, is that a good plan?
- Is
updated_at reliably updated? Is is safe to base a pipeline on?
- Do I actually need to recurse through the children once this in incremental? It's in there as we found filtering on our department in the search API missed lots of documents our department published in related pages
- On testing I'll sometimes get
JSONDecodeError for very new items, which makes me think I'm picking up drafts. Is there a field I'm missing to ignore these until they're ready?
I work at DBT and have been improving an ETL pipeline for gov.uk content we have based on parameters the department needs. I'd like to configure it so it ingests and overwrites data that's changed rather than ingesting everything over and over again.
My plan is:
updated_atfield to return results changed in the last few daysupdated_atfor new contentFrom the other side of the API, is that a good plan?
updated_atreliably updated? Is is safe to base a pipeline on?JSONDecodeErrorfor very new items, which makes me think I'm picking up drafts. Is there a field I'm missing to ignore these until they're ready?