- Dataframe operator would now allow a user to either
appendto a table orreplacea table withif_existsparameter. #1379
- Fix the
aql.cleanup()operator as failing as the attributeoutputwas implemented in 2.4.0 #1359 - Fix the backward compatibility with
apache-airflow-providers-snowflake==4.0.2. #1351 - LoadFile operator returns a dataframe if not using XCom backend.#1348,#1337
- Fix the functionality to create region specific temporary schemas when they don't exist in same region. #1369
- Cross-link to API reference page from Operators page.#1383
- Improve the integration tests to count the number of rows impacted for database operations. #1273
- Run python-sdk tests with airflow 2.5.0 and fix the CI failures. #1232, #1351,#1317, #1337
- Deprecate
export_filebefore renaming toexport_table_to_file. #1411
-
Remove the need to use a custom Xcom backend for storing dataframes when Xcom pickling is disabled. #1334, #1331,#1319
-
Add support to Google Drive to be used as
FileLocation. Example to load file from Google Drive to Snowflake #1044aql.load_file( input_file=File( path="gdrive://sample-google-drive/sample.csv", conn_id="gdrive_conn" ), output_table=Table( conn_id=SNOWFLAKE_CONN_ID, metadata=Metadata( database=os.environ["SNOWFLAKE_DATABASE"], schema=os.environ["SNOWFLAKE_SCHEMA"], ), ), )
- Use
DefaultExtractorfrom OpenLineage. Users need not set environment variableOPENLINEAGE_EXTRACTORSto use OpenLineage. #1223, #1292 - Generate constraints file for multiple Python and Airflow version that display the set of "installable" constraints for a particular Python (3.7, 3.8, 3.9) and Airflow version (2.2.5, 2.3.4, 2.4.2) #1226
- Improve the logs in case native transfers fallbacks to Pandas as well as fallback indication in
LoadFileOperator. #1263
- Temporary tables should be cleaned up, even with mapped tasks via
aql.cleanup()#963 - Update the name and namespace as per Open Lineage new conventions introduced here. #1281
- Delete the Snowflake stage when
LoadFileOperatorfails. #1262
- Update the documentation for Google Drive support. #1044
- Update the documentation to remove the environment variable
OPENLINEAGE_EXTRACTORSto use OpenLineage. #1292
- Fix the GCS path in
aql.export_filein the example DAGs. #1339
- When
if_existsis set toreplacein Dataframe operator, replace the table rather than append. This change fixes a regression on the Dataframe operator which caused it to append content to an output table instead of replacing. #1260 - Pass the table metadata
databasevalue to the underlying airflowPostgresHookinstead ofschemaas schema is renamed to database in airflow as per this PR. #1276
- Include description on pickling and usage of custom Xcom backend in README.md #1203
- Investigate and fix tests that are filling up Snowflake database with tmp tables as part of our CI execution. #738
- Make
openlineagean optional dependency #1252 - Update snowflake-sqlalchemy version #1228
- Raise error if dataframe is empty #1238
- Raise error db mismatch of operation #1233
- Pass
task_idto be used for parent class onLoadFileOperatorinit #1259
- Add support for Minio #750
- Open Lineage support - Add Extractor for
ExportFileOperator,DataframeOperator#903, #1183
- Add check for missing conn_id on transform operator. #1152
- Raise error when
copy intoquery fails in snowflake. #890 - Transform op - database/schema is not picked from table's metadata. #1034
- Change the namespace for Open Lineage #1179
- Add
LOAD_FILE_ENABLE_NATIVE_FALLBACKconfig to globally disable native fallback #1089 - Add
OPENLINEAGE_EMIT_TEMP_TABLE_EVENTconfig to emit events for tmp table in Open Lineage. #1121 - Fix issue with fetching table row count for snowflake #1145
- Generate unique Open Lineage namespace for Sqlite based operations #1141
- Include section in docs to cover file pattern for native path of GCS to Bigquery . #800
- Add guide for Open Lineage integration with Astro Python SDK #1116
- Pin SQLAlchemy version to >=1.3.18,<1.4.42 #1185
- Remove dependency on
AIRFLOW__CORE__ENABLE_XCOM_PICKLING. Users can set new environment variables, namelyAIRFLOW__ASTRO_SDK__XCOM_STORAGE_CONN_IDandAIRFLOW__ASTRO_SDK__XCOM_STORAGE_URLand use a custom XCOM backend namely,AstroCustomXcomBackendwhich enables the XCOM data to be saved to an S3 or GCS location. #795, #997 - Added OpenLineage support for
LoadFileOperator,AppendOperator,TransformOperatorandMergeOperator#898, #899, #902, #901 and #900 - Add
TransformFileOperatorthat- parses a SQL file with templating
- applies all needed parameters
- runs the SQL to return a table object to keep the
aql.transform_filefunction, the function can returnTransformFileOperator().outputin a similar fashion to the merge operator. #892
- Add the implementation for row count for
BaseTable. #1073
- Improved handling of snowflake identifiers for smooth experience with
dataframeandrun_raw_sqlandload_fileoperators. #917, #1098 - Fix
transform_fileto not depend ontransformdecorator #1004 - Set the CI to run and publish benchmark reports once a week #443
- Fix cyclic dependency and improve import time. Reduces the import time for
astro/databases/__init__.pyfrom 23.254 seconds to 0.062 seconds #1013
- Create GETTING_STARTED.md #1036
- Document the Open Lineage facets published by Astro Python SDK. #1086
- Documentation changes to specify permissions needed for running BigQuery jobs. #896
- Document the details on custom XCOM. #1100
- Document the benchmarking process. #1017
- Include a detailed description on the default Dataset concept in Astro Python SDK. #1092
- NFS volume mount in Kubernetes to test benchmarking from local to databases. #883
- Add filetype when resolving path in case of loading into dataframe #881
- Fix postgres performance regression (example from one_gb file - 5.56min to 1.84min) #876
-
Add native autodetect schema feature #780
-
Allow users to disable auto addition of inlets/outlets via airflow.cfg #858
-
Support for Datasets introduced in Airflow 2.4 #786, #808, #862, #871
-
inletsandoutletswill be automatically set for all the operators. -
Users can now schedule DAGs on
FileandTableobjects. Example:input_file = File( path="https://raw.githubusercontent.com/astronomer/astro-sdk/main/tests/data/imdb_v2.csv" ) imdb_movies_table = Table(name="imdb_movies", conn_id="sqlite_default") top_animations_table = Table(name="top_animation", conn_id="sqlite_default") START_DATE = datetime(2022, 9, 1) @aql.transform() def get_top_five_animations(input_table: Table): return """ SELECT title, rating FROM {{input_table}} WHERE genre1='Animation' ORDER BY rating desc LIMIT 5; """ with DAG( dag_id="example_dataset_producer", schedule=None, start_date=START_DATE, catchup=False, ) as load_dag: imdb_movies = aql.load_file( input_file=input_file, task_id="load_csv", output_table=imdb_movies_table, ) with DAG( dag_id="example_dataset_consumer", schedule=[imdb_movies_table], start_date=START_DATE, catchup=False, ) as transform_dag: top_five_animations = get_top_five_animations( input_table=imdb_movies_table, output_table=top_animations_table, )
-
-
Dynamic Task Templates: Tasks that can be used with Dynamic Task Mapping (Airflow 2.3+)
-
Create upstream_tasks parameter for dependencies independent of data transfers #585
- Avoid loading whole file into memory with load_operator for schema detection #805
- Directly pass the file to native library when native support is enabled #802
- Create file type for patterns for schema auto-detection #872
- Add compat module for typing execute
contextin operators #770 - Fix sql injection issues #807
- Add response_size to run_raw_sql and warn about db thrashing #815
- Update quick start example #819
- Add links to docs from README #832
- Fix Astro CLI doc link #842
- Add configuration details from settings.py #861
- Add section explaining table metadata #774
- Fix docstring for run_raw_sql #817
- Add missing docs for Table class #788
- Add the readme.md example dag to example dags folder #681
- Add reason for enabling XCOM pickling #747
- Skip folders while processing paths in load_file operator when file pattern is passed. #733
- Limit Google Protobuf for compatibility with bigquery client. #742
- Added a check to create table only when
if_existsisreplaceinaql.load_filefor snowflake. #729 - Fix the file type for NDJSON file in Data transfer job in AWS S3 to Google BigQuery. #724
- Create a new version of imdb.csv with lowercase column names and update the examples to use it, so this change is backwards-compatible. #721, #727
- Skip folders while processing paths in load_file operator when file patterns is passed. #733
-
Updated the Benchmark docs for GCS to Snowflake and S3 to Snowflake of
aql.load_file#712#707 -
Restructured the documentation in the
project.toml, quickstart, readthedocs and README.md #698, #704, #706 -
Make astro-sdk-python compatible with major version of Google Providers. #703
- Consolidate the documentation requirements for sphinx. #699
- Add CI/CD triggers on release branches with dependency on tests. #672
-
Improved the performance of
aql.load_fileby supporting database-specific (native) load methods. This is now the default behaviour. Previously, the Astro SDK Python would always use Pandas to load files to SQL databases which passed the data to worker node which slowed the performance. #557, #481Introduced new arguments to
aql.load_file:use_native_supportfor data transfer if available on the destination (defaults touse_native_support=True)native_support_kwargsis a keyword argument to be used by method involved in native support flow.enable_native_fallbackcan be used to fall back to default transfer(defaults toenable_native_fallback=True).
Now, there are three modes:
Native: Default, uses Bigquery Load Job in the case of BigQuery and Snowflake COPY INTO using external stage in the case of Snowflake.Pandas: This is how datasets were previously loaded. To enable this mode, use the argumentuse_native_support=Falseinaql.load_file.Hybrid: This attempts to use the native strategy to load a file to the database and if native strategy(i) fails , fallback to Pandas (ii) with relevant log warnings. #557
-
Allow users to specify the table schema (column types) in which a file is being loaded by using
table.columns. If this table attribute is not set, the Astro SDK still tries to infer the schema by using Pandas (which is previous behaviour).#532 -
Add Example DAG for Dynamic Map Task with Astro-SDK. #377,airflow-2.3.0
- The
aql.dataframeargumentidentifiers_as_lower(which wasboolean, with default set toFalse) was replaced by the argumentcolumns_names_capitalization(stringwithin possible values["upper", "lower", "original"], default islower).#564 - The
aql.load_filebefore would change the capitalization of all column titles to be uppercase, by default, now it makes them lowercase, by default. The old behaviour can be achieved by using the argumentcolumns_names_capitalization="upper". #564 aql.load_fileattempts to load files to BigQuery and Snowflake by using native methods, which may have pre-requirements to work. To disable this mode, use the argumentuse_native_support=Falseinaql.load_file. #557, #481aql.dataframewill raise an exception if the default Airflow XCom backend is being used. To solve this, either use an external XCom backend, such as S3 or GCS or set the configurationAIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True. #444- Change the declaration for the default Astro SDK temporary schema from using
AIRFLOW__ASTRO__SQL_SCHEMAtoAIRFLOW__ASTRO_SDK__SQL_SCHEMA#503 - Renamed
aql.truncatetoaql.drop_table#554
- Fix missing airflow's task terminal states to
CleanupOperator#525 - Allow chaining
aql.drop_table(previouslytruncate) tasks using the Task Flow API syntax. #554, #515
- Improved the performance of
aql.load_filefor files for below: - Get configurations via Airflow Configuration manager. #503
- Change catching
ValueErrorandAttributeErrortoDatabaseCustomError#595 - Unpin pandas upperbound dependency #620
- Remove markupsafe from dependencies #623
- Added
extend_existingto Sqla Table object #626 - Move config to store DF in XCom to settings file #537
- Make the operator names consistent #634
- Use
exc_infofor exception logging #643 - Update query for getting bigquery table schema #661
- Use lazy evaluated Type Annotations from PEP 563 #650
- Provide Google Cloud Credentials env var for bigquery #679
- Handle breaking changes for Snowflake provide version 3.2.0 and 3.1.0 #686
Feature:
Internals:
Enhancement:
- Fail LoadFileOperator operator when input_file does not exist #467
- Create scripts to launch benchmark testing to Google cloud #432
- Bump Google Provider for google extra #294
Feature:
Breaking Change:
aql.mergeinterface changed. Argumentmerge_tablechanged totarget_table,target_columnsandmerge_columncombined tocolumnargument,merge_keysis changed totarget_conflict_columns,conflict_strategyis changed toif_conflicts. More details can be found at 422, #466
Enhancement:
- Document (new) load_file benchmark datasets #449
- Made improvement to benchmark scripts and configurations #458, #434, #461, #460, #437, #462
- Performance evaluation for loading datasets with Astro Python SDK 0.9.2 into BigQuery #437
Bug fix:
- Change export_file to return File object #454.
Bug fix:
- Table unable to have Airflow templated names #413
Enhancements:
- Introduction of the user-facing
Table,MetadataandFileclasses
Breaking changes:
- The operator
save_filebecameexport_file - The tasks
load_file,export_file(previouslysave_file) andrun_raw_sqlshould be used with useTable,MetadataandFileinstances - The decorators
dataframe,run_raw_sqlandtransformshould be used withTableandMetadatainstances - The operators
aggregate_check,boolean_check,renderandstats_checkwere temporarily removed - The class
TempTablewas removed. It is possible to declare temporary tables by usingTable(temp=True). All the temporary tables names are prefixed with_tmp_. If the user decides to name aTable, it is no longer temporary, unless the user enforces it to be. - The only mandatory property of a
Tableinstance isconn_id. If no metadata is given, the library will try to extract schema and other information from the connection object. If it is missing, it will default to theAIRFLOW__ASTRO__SQL_SCHEMAenvironment variable.
Internals:
- Major refactor introducing
Database,File,FileTypeandFileLocationconcepts.
Enhancements:
- Add support for Airflow 2.3 #367.
Breaking change:
- We have renamed the artifacts we released to
astro-sdk-pythonfromastro-projects.0.8.4is the last version for which we have published bothastro-sdk-pythonandastro-projects.
Bug fix:
- Do not attempt to create a schema if it already exists #329.
Bug fix:
- Support dataframes from different databases in dataframe operator #325
Enhancements:
- Add integration testcase for
SqlDecoratedOperatorto test execution of Raw SQL #316
Bug fix:
- Snowflake transform without
input_table#319
Feature:
*load_file support for nested NDJSON files #257
Breaking change:
aql.dataframeswitches the capitalization to lowercase by default. This behaviour can be changed by usingidentifiers_as_lower#154
Documentation:
- Fix commands in README.md #242
- Add scripts to auto-generate Sphinx documentation
Enhancements:
- Improve type hints coverage
- Improve Amazon S3 example DAG, so it does not rely on pre-populated data #293
- Add example DAG to load/export from BigQuery #265
- Fix usages of mutable default args #267
- Enable DeepSource validation #299
- Improve code quality and coverage
Bug fixes:
- Support
gcpbigqueryconnections #294 - Support
paramsargument inaql.renderto override SQL Jinja template values #254 - Fix
aql.dataframewhen table arg is absent #259
Others:
- Refactor integration tests, so they can run across all supported databases #229, #234, #235, #236, #206, #217
Feature:
load_fileto a Pandas dataframe, without SQL database dependencies #77
Documentation:
- Simplify README #101
- Add Release Guidelines #160
- Add Code of Conduct #101
- Add Contribution Guidelines #101
Enhancements:
- Add SQLite example #149
- Allow customization of
task_idwhen usingdataframe#126 - Use standard AWS environment variables, as opposed to
AIRFLOW__ASTRO__CONN_AWS_DEFAULT#175
Bug fixes:
- Fix
mergeXComArgsupport #183 - Fixes to
load_file: - Fixes to
render: - Fix
transform, so it works with SQLite #159
Others:
Features:
- Support SQLite #86
- Support users who can't create schemas #121
- Ability to install optional dependencies (amazon, google, snowflake) #82
Enhancements:
- Change
renderso it creates a DAG as opposed to a TaskGroup #143 - Allow users to specify a custom version of
snowflake_sqlalchemy#127
Bug fixes:
Others: