-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Description
During the publish phase we have everything we need to create an external schema in Redshift / register the meta for Athena. Since we know this is in AWS, this would be a hugely powerful addition to current functionality
Pseudocode
parq.register(target="Redshift")
Why?
By registering a schema at publish, this makes the written data immediately queryable via any SQL workbench tool. We should standardize that the external schema is everything in the path leading up to the dataset name, and the table is the dataset name. So for a path
s3://bananabucket/this/is/a/prefix/dataset/id=123/name=steve/asf809dg8jkljsd12.parquet
the external schema to register would be bananabucket_this_is_a_prefix and the table would be
dataset. So querying it via Spectrum / Athena would be
SELECT * FROM bananabucket_this_is_a_prefix.dataset WHERE id > 122 ... WOAH.