diff --git a/docs/api.rst b/docs/api.rst index 8c69f8d2..79c47e2f 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,6 +1,6 @@ .. _api: -API Reference +API reference ============= This section provides comprehensive API documentation for all PyAthena classes and functions, organized by functionality. @@ -10,21 +10,21 @@ This section provides comprehensive API documentation for all PyAthena classes a :caption: API Documentation: api/connection + api/converters + api/utilities + api/errors + api/filesystem + api/models + api/sqlalchemy api/pandas api/arrow api/s3fs api/spark - api/converters - api/sqlalchemy - api/filesystem - api/models - api/utilities - api/errors -Quick Reference +Quick reference --------------- -Core Functionality +Core functionality ~~~~~~~~~~~~~~~~~~ - :ref:`api_connection` - Connection management and basic cursors @@ -32,17 +32,17 @@ Core Functionality - :ref:`api_utilities` - Utility functions and base classes - :ref:`api_errors` - Exception handling and error classes -Specialized Integrations +Infrastructure +~~~~~~~~~~~~~~~ + +- :ref:`api_filesystem` - S3 filesystem integration and object management +- :ref:`api_models` - Athena query execution and metadata models + +Specialized integrations ~~~~~~~~~~~~~~~~~~~~~~~~ +- :ref:`api_sqlalchemy` - SQLAlchemy dialect implementations - :ref:`api_pandas` - pandas DataFrame integration - :ref:`api_arrow` - Apache Arrow columnar data integration - :ref:`api_s3fs` - Lightweight S3FS-based cursor (no pandas/pyarrow required) - :ref:`api_spark` - Apache Spark integration for big data processing -- :ref:`api_sqlalchemy` - SQLAlchemy dialect implementations - -Infrastructure -~~~~~~~~~~~~~~~ - -- :ref:`api_filesystem` - S3 filesystem integration and object management -- :ref:`api_models` - Athena query execution and metadata models \ No newline at end of file diff --git a/docs/cursor.rst b/docs/cursor.rst index e7d60a44..53ea6db8 100644 --- a/docs/cursor.rst +++ b/docs/cursor.rst @@ -329,9 +329,6 @@ AsyncS3FSCursor See :ref:`async-s3fs-cursor`. -API Reference -------------- - SparkCursor ----------- diff --git a/docs/index.rst b/docs/index.rst index 29000d80..a372eebe 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -16,8 +16,8 @@ Documentation introduction usage - sqlalchemy cursor + sqlalchemy pandas arrow s3fs diff --git a/docs/introduction.rst b/docs/introduction.rst index dc1e90fa..b28bf509 100644 --- a/docs/introduction.rst +++ b/docs/introduction.rst @@ -43,7 +43,7 @@ PyAthena provides comprehensive support for Amazon Athena's data types and featu **Core Features:** - **DB API 2.0 Compliance**: Full PEP 249 compatibility for database operations - **SQLAlchemy Integration**: Native dialect support with table reflection and ORM capabilities - - **Multiple Cursor Types**: Standard, Pandas, Arrow, and Spark cursor implementations + - **Multiple Cursor Types**: Standard, Pandas, Arrow, S3FS, and Spark cursor implementations - **Async Support**: Asynchronous query execution for non-blocking operations **Data Type Support:** diff --git a/docs/s3fs.rst b/docs/s3fs.rst index d090f939..3f2b7faa 100644 --- a/docs/s3fs.rst +++ b/docs/s3fs.rst @@ -111,7 +111,7 @@ Execution information of the query can also be retrieved. print(cursor.service_processing_time_in_millis) print(cursor.output_location) -Type Conversion +Type conversion ~~~~~~~~~~~~~~~ S3FSCursor converts Athena data types to Python types using the built-in converter. @@ -172,7 +172,7 @@ Then specify an instance of this class in the converter argument when creating a cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/", region_name="us-west-2").cursor(S3FSCursor, converter=CustomS3FSTypeConverter()) -CSV Reader Options +CSV reader options ~~~~~~~~~~~~~~~~~~ S3FSCursor supports pluggable CSV reader implementations to control how NULL values and empty strings diff --git a/docs/sqlalchemy.rst b/docs/sqlalchemy.rst index f55fd81c..1c83d5af 100644 --- a/docs/sqlalchemy.rst +++ b/docs/sqlalchemy.rst @@ -54,6 +54,8 @@ Dialect & driver +-----------+--------+------------------+----------------------+ | awsathena | arrow | awsathena+arrow | :ref:`arrow-cursor` | +-----------+--------+------------------+----------------------+ +| awsathena | s3fs | awsathena+s3fs | :ref:`s3fs-cursor` | ++-----------+--------+------------------+----------------------+ Dialect options --------------- @@ -504,6 +506,7 @@ The ``on_start_query_execution`` callback is supported by all PyAthena SQLAlchem * ``awsathena`` and ``awsathena+rest`` (default cursor) * ``awsathena+pandas`` (pandas cursor) * ``awsathena+arrow`` (arrow cursor) +* ``awsathena+s3fs`` (S3FS cursor) Usage with different dialects: @@ -521,15 +524,15 @@ Usage with different dialects: connect_args={"on_start_query_execution": query_callback} ) -Complex Data Types +Complex data types ------------------ -STRUCT Type Support +STRUCT type support ~~~~~~~~~~~~~~~~~~~ PyAthena provides comprehensive support for Amazon Athena's STRUCT (also known as ROW) data types, enabling you to work with complex nested data structures in your Python applications. -Basic Usage +Basic usage ^^^^^^^^^^^ .. code:: python @@ -564,7 +567,7 @@ This generates the following SQL structure: settings ROW(theme STRING, notifications ROW(email STRING, push STRING)) ) -Querying STRUCT Data +Querying STRUCT data ^^^^^^^^^^^^^^^^^^^^ PyAthena automatically converts STRUCT data between different formats: @@ -579,11 +582,11 @@ PyAthena automatically converts STRUCT data between different formats: text("SELECT ROW('John Doe', 30, 'john@example.com') as profile") ) ).fetchone() - + # Access STRUCT fields as dictionary profile = result.profile # {"0": "John Doe", "1": 30, "2": "john@example.com"} -Named STRUCT Fields +Named STRUCT fields ^^^^^^^^^^^^^^^^^^^ For better readability, use JSON casting to get named fields: @@ -596,12 +599,12 @@ For better readability, use JSON casting to get named fields: text("SELECT CAST(ROW('John', 30) AS JSON) as user_data") ) ).fetchone() - + # Parse JSON result import json user_data = json.loads(result.user_data) # ["John", 30] -Data Format Support +Data format support ^^^^^^^^^^^^^^^^^^^ PyAthena supports multiple STRUCT data formats: @@ -617,7 +620,7 @@ PyAthena supports multiple STRUCT data formats: .. code:: python - # Input: '{"name": "John", "age": 30}' + # Input: '{"name": "John", "age": 30}' # Output: {"name": "John", "age": 30} **Unnamed STRUCT Format:** @@ -627,14 +630,14 @@ PyAthena supports multiple STRUCT data formats: # Input: "{Alice, 25}" # Output: {"0": "Alice", "1": 25} -Performance Considerations +Performance considerations ^^^^^^^^^^^^^^^^^^^^^^^^^^ - **JSON Format**: Recommended for complex nested structures - **Native Format**: Optimized for simple key-value pairs - **Smart Detection**: PyAthena automatically detects the format to avoid unnecessary parsing overhead -Best Practices +Best practices ^^^^^^^^^^^^^^ 1. **Use JSON casting** for complex nested structures: @@ -666,7 +669,7 @@ Best Practices # Process struct data field_value = result.struct_column.get('field_name') -Migration from Raw Strings +Migration from RAW strings ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Before (raw string handling):** @@ -686,12 +689,12 @@ Migration from Raw Strings struct_data = result[0] # {"name": "John", "age": 30} - automatically converted name = struct_data['name'] # Direct access -MAP Type Support +MAP type support ~~~~~~~~~~~~~~~~ PyAthena provides comprehensive support for Amazon Athena's MAP data types, enabling you to work with key-value data structures in your Python applications. -Basic Usage +Basic usage ^^^^^^^^^^^ .. code:: python @@ -718,7 +721,7 @@ This generates the following SQL structure: categories MAP ) -Querying MAP Data +Querying MAP data ^^^^^^^^^^^^^^^^^ PyAthena automatically converts MAP data between different formats: @@ -733,11 +736,11 @@ PyAthena automatically converts MAP data between different formats: text("SELECT MAP(ARRAY['name', 'category'], ARRAY['Laptop', 'Electronics']) as product_info") ) ).fetchone() - + # Access MAP data as dictionary product_info = result.product_info # {"name": "Laptop", "category": "Electronics"} -Advanced MAP Operations +Advanced MAP operations ^^^^^^^^^^^^^^^^^^^^^^^ For complex MAP operations, use JSON casting: @@ -750,12 +753,12 @@ For complex MAP operations, use JSON casting: text("SELECT CAST(MAP(ARRAY['price', 'rating'], ARRAY['999', '4.5']) AS JSON) as data") ) ).fetchone() - + # Parse JSON result import json data = json.loads(result.data) # {"price": "999", "rating": "4.5"} -Data Format Support +Data format support ^^^^^^^^^^^^^^^^^^^ PyAthena supports multiple MAP data formats: @@ -771,17 +774,17 @@ PyAthena supports multiple MAP data formats: .. code:: python - # Input: '{"name": "Laptop", "category": "Electronics"}' + # Input: '{"name": "Laptop", "category": "Electronics"}' # Output: {"name": "Laptop", "category": "Electronics"} -Performance Considerations +Performance considerations ^^^^^^^^^^^^^^^^^^^^^^^^^^ - **JSON Format**: Recommended for complex nested structures - **Native Format**: Optimized for simple key-value pairs - **Smart Detection**: PyAthena automatically detects the format to avoid unnecessary parsing overhead -Best Practices +Best practices ^^^^^^^^^^^^^^ 1. **Use JSON casting** for complex nested structures: @@ -805,7 +808,7 @@ Best Practices # Process map data value = result.map_column.get('key_name') -Migration from Raw Strings +Migration from RAW strings ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Before (raw string handling):** @@ -825,7 +828,7 @@ Migration from Raw Strings map_data = result[0] # {"key1": "value1", "key2": "value2"} - automatically converted value = map_data['key1'] # Direct access -ARRAY Type Support +ARRAY type support ~~~~~~~~~~~~~~~~~~ PyAthena provides comprehensive support for Amazon Athena's ARRAY data types, enabling you to work with ordered collections of data in your Python applications. @@ -857,7 +860,7 @@ This creates a table definition equivalent to: categories ARRAY ) -Querying ARRAY Data +Querying ARRAY data ^^^^^^^^^^^^^^^^^^^ PyAthena automatically converts ARRAY data between different formats: @@ -872,11 +875,11 @@ PyAthena automatically converts ARRAY data between different formats: text("SELECT ARRAY[1, 2, 3, 4, 5] as item_ids") ) ).fetchone() - + # Access ARRAY data as Python list item_ids = result.item_ids # [1, 2, 3, 4, 5] -Complex ARRAY Operations +Complex ARRAY operations ^^^^^^^^^^^^^^^^^^^^^^^^ For arrays containing complex data types: @@ -889,7 +892,7 @@ For arrays containing complex data types: text("SELECT ARRAY[ROW('Alice', 25), ROW('Bob', 30)] as users") ) ).fetchone() - + users = result.users # [{"0": "Alice", "1": 25}, {"0": "Bob", "1": 30}] # Using CAST AS JSON for complex ARRAY operations @@ -898,7 +901,7 @@ For arrays containing complex data types: text("SELECT CAST(ARRAY[1, 2, 3] AS JSON) as data") ) ).fetchone() - + # Parse JSON result import json if isinstance(result.data, str): @@ -906,7 +909,7 @@ For arrays containing complex data types: else: array_data = result.data # Already converted to list -Data Format Support +Data format support ^^^^^^^^^^^^^^^^^^^ PyAthena supports multiple ARRAY data formats: @@ -918,7 +921,7 @@ PyAthena supports multiple ARRAY data formats: # Input: '[1, 2, 3]' # Output: [1, 2, 3] - # Input: '[apple, banana, cherry]' + # Input: '[apple, banana, cherry]' # Output: ["apple", "banana", "cherry"] **JSON Format:** @@ -927,7 +930,7 @@ PyAthena supports multiple ARRAY data formats: # Input: '[1, 2, 3]' # Output: [1, 2, 3] - + # Input: '["apple", "banana", "cherry"]' # Output: ["apple", "banana", "cherry"] @@ -938,7 +941,7 @@ PyAthena supports multiple ARRAY data formats: # Input: '[{name=John, age=30}, {name=Jane, age=25}]' # Output: [{"name": "John", "age": 30}, {"name": "Jane", "age": 25}] -Type Definitions +Type definitions ^^^^^^^^^^^^^^^^ AthenaArray supports various item types: @@ -950,15 +953,15 @@ AthenaArray supports various item types: # Simple arrays AthenaArray(String) # ARRAY AthenaArray(Integer) # ARRAY - + # Arrays of complex types AthenaArray(AthenaStruct(...)) # ARRAY> AthenaArray(AthenaMap(...)) # ARRAY> - + # Nested arrays AthenaArray(AthenaArray(Integer)) # ARRAY> -Best Practices +Best practices ^^^^^^^^^^^^^^ 1. **Use appropriate item types** in AthenaArray definitions: @@ -983,7 +986,7 @@ Best Practices # Process array data first_item = result.array_column[0] if result.array_column else None -Migration from Raw Strings +Migration from RAW strings ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Before (raw string handling):** @@ -1003,12 +1006,12 @@ Migration from Raw Strings array_data = result[0] # [1, 2, 3] - automatically converted first_item = array_data[0] # Direct access -JSON Type Support +JSON type support ~~~~~~~~~~~~~~~~~ PyAthena provides support for Amazon Athena's JSON data type, enabling you to work with JSON data in your SQLAlchemy applications. The JSON type is primarily used with Data Manipulation Language (DML) operations in Athena. -Basic Usage +Basic usage ^^^^^^^^^^^ .. code:: python @@ -1023,7 +1026,7 @@ Basic Usage Column('config', JSON) ) -Querying JSON Data +Querying JSON data ^^^^^^^^^^^^^^^^^^ When querying JSON data, PyAthena automatically parses JSON strings into Python dictionaries: @@ -1048,7 +1051,7 @@ When querying JSON data, PyAthena automatically parses JSON strings into Python print(result.json_col) # {"name": "test", "value": 123} print(type(result.json_col)) # -Important Limitations +Important limitations ^^^^^^^^^^^^^^^^^^^^^ Athena's JSON type support has specific limitations: @@ -1075,7 +1078,7 @@ Athena's JSON type support has specific limitations: # This will raise InvalidRequestException # CAST('[1, 2, 3]' AS JSON) -Best Practices +Best practices ^^^^^^^^^^^^^^ 1. **Use with SELECT queries** - JSON type works best for querying existing data diff --git a/docs/usage.rst b/docs/usage.rst index 705ef54a..e0ec51ff 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -180,14 +180,14 @@ The S3 staging directory is not checked, so it's possible that the location of t .. _query-execution-callback: -Query Execution Callback +Query execution callback ------------------------- -PyAthena provides a callback mechanism that allows you to get immediate access to the query ID +PyAthena provides a callback mechanism that allows you to get immediate access to the query ID as soon as the ``start_query_execution`` API call is made, before waiting for query completion. This is useful for monitoring, logging, or cancelling long-running queries from another thread. -The ``on_start_query_execution`` callback can be configured at both the connection level and +The ``on_start_query_execution`` callback can be configured at both the connection level and the execute level. When both are set, both callbacks will be invoked. Connection-level callback @@ -208,7 +208,7 @@ You can set a default callback for all queries executed through a connection: region_name="us-west-2", on_start_query_execution=query_callback ).cursor() - + cursor.execute("SELECT * FROM many_rows") # Callback will be invoked Execute-level callback @@ -227,9 +227,9 @@ You can also specify a callback for individual query executions: s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/", region_name="us-west-2" ).cursor() - + cursor.execute( - "SELECT * FROM many_rows", + "SELECT * FROM many_rows", on_start_query_execution=specific_callback ) @@ -246,7 +246,7 @@ A common use case is to cancel long-running analytical queries after a timeout: def cancel_long_running_query(): """Example: Cancel a complex analytical query after 10 minutes.""" - + def track_query_start(query_id): print(f"Long-running analysis started: {query_id}") return query_id @@ -269,27 +269,27 @@ A common use case is to cancel long-running analytical queries after a timeout: # Complex analytical query that might run for a long time long_query = """ WITH daily_metrics AS ( - SELECT + SELECT date_trunc('day', timestamp_col) as day, user_id, COUNT(*) as events, AVG(duration) as avg_duration - FROM large_events_table + FROM large_events_table WHERE timestamp_col >= current_date - interval '1' year GROUP BY 1, 2 ), user_segments AS ( - SELECT + SELECT user_id, - CASE + CASE WHEN AVG(events) > 100 THEN 'high_activity' - WHEN AVG(events) > 10 THEN 'medium_activity' + WHEN AVG(events) > 10 THEN 'medium_activity' ELSE 'low_activity' END as segment FROM daily_metrics GROUP BY user_id ) - SELECT + SELECT segment, COUNT(DISTINCT user_id) as users, AVG(events) as avg_daily_events @@ -307,13 +307,13 @@ A common use case is to cancel long-running analytical queries after a timeout: try: print("Starting complex analytical query (10-minute timeout)...") cursor.execute(long_query) - + # Process results results = cursor.fetchall() print(f"Analysis completed successfully: {len(results)} segments found") for row in results: print(f" {row[0]}: {row[1]} users, {row[2]:.1f} avg events") - + except Exception as e: print(f"Query failed or was cancelled: {e}") finally: @@ -329,7 +329,7 @@ A common use case is to cancel long-running analytical queries after a timeout: Multiple callbacks ~~~~~~~~~~~~~~~~~~~ -When both connection-level and execute-level callbacks are specified, +When both connection-level and execute-level callbacks are specified, both callbacks will be invoked: .. code:: python @@ -349,10 +349,10 @@ both callbacks will be invoked: region_name="us-west-2", on_start_query_execution=connection_callback ).cursor() - + # This will invoke both connection_callback and execute_callback cursor.execute( - "SELECT 1", + "SELECT 1", on_start_query_execution=execute_callback ) @@ -362,11 +362,11 @@ Supported cursor types The ``on_start_query_execution`` callback is supported by the following cursor types: * ``Cursor`` (default cursor) -* ``DictCursor`` +* ``DictCursor`` * ``ArrowCursor`` * ``PandasCursor`` -Note: ``AsyncCursor`` and its variants do not support this callback as they already +Note: ``AsyncCursor`` and its variants do not support this callback as they already return the query ID immediately through their different execution model. Environment variables