-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Hi @joyceannie,
Excellent work. I learn a lot from your work. I tried to reproduce the pipeline and when I connected to airflow GUI and inspect the DAG, the first task, extract_reddit_data failed with this error below :
*** Reading local file: /opt/airflow/logs/dag_id=elt_reddit_pipeline/run_id=scheduled__2023-08-13T00:00:00+00:00/task_id=extract_reddit_data/attempt=1.log
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: elt_reddit_pipeline.extract_reddit_data scheduled__2023-08-13T00:00:00+00:00 [queued]>
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: elt_reddit_pipeline.extract_reddit_data scheduled__2023-08-13T00:00:00+00:00 [queued]>
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1356} INFO -
--------------------------------------------------------------------------------
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1357} INFO - Starting attempt 1 of 1
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1358} INFO -
--------------------------------------------------------------------------------
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1377} INFO - Executing <Task(BashOperator): extract_reddit_data> on 2023-08-13 00:00:00+00:00
[2023-08-14, 12:20:55 UTC] {standard_task_runner.py:52} INFO - Started process 103 to run task
[2023-08-14, 12:20:55 UTC] {standard_task_runner.py:79} INFO - Running: ['***', 'tasks', 'run', 'elt_reddit_pipeline', 'extract_reddit_data', 'scheduled__2023-08-13T00:00:00+00:00', '--job-id', '3', '--raw', '--subdir', 'DAGS_FOLDER/elt_reddit_pipeline.py', '--cfg-path', '/tmp/tmpvs1xefxw', '--error-file', '/tmp/tmp4abobruo']
[2023-08-14, 12:20:55 UTC] {standard_task_runner.py:80} INFO - Job 3: Subtask extract_reddit_data
[2023-08-14, 12:20:55 UTC] {task_command.py:370} INFO - Running <TaskInstance: elt_reddit_pipeline.extract_reddit_data scheduled__2023-08-13T00:00:00+00:00 [running]> on host 56fecb3fcaf8
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1571} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=elt_reddit_pipeline
AIRFLOW_CTX_TASK_ID=extract_reddit_data
AIRFLOW_CTX_EXECUTION_DATE=2023-08-13T00:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2023-08-13T00:00:00+00:00
[2023-08-14, 12:20:55 UTC] {subprocess.py:62} INFO - Tmp dir root location:
/tmp
[2023-08-14, 12:20:55 UTC] {subprocess.py:74} INFO - Running command: ['bash', '-c', 'python /opt/***/extraction/reddit_extract_data.py 2023-08-14']
[2023-08-14, 12:20:55 UTC] {subprocess.py:85} INFO - Output:
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - Successfully connected to Reddit
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - Traceback (most recent call last):
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/opt/***/extraction/reddit_extract_data.py", line 99, in <module>
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - main()
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/opt/***/extraction/reddit_extract_data.py", line 48, in main
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - load_to_csv(transformed_data)
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/opt/***/extraction/reddit_extract_data.py", line 95, in load_to_csv
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - data.to_csv(f'tmp/{output_name}.csv', index = False)
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/home/***/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 3482, in to_csv
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - storage_options=storage_options,
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/home/***/.local/lib/python3.7/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - csv_formatter.save()
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/home/***/.local/lib/python3.7/site-packages/pandas/io/formats/csvs.py", line 243, in save
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - storage_options=self.storage_options,
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - File "/home/***/.local/lib/python3.7/site-packages/pandas/io/common.py", line 707, in get_handle
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - newline="",
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - FileNotFoundError: [Errno 2] No such file or directory: 'tmp/2023-08-14.csv'
[2023-08-14, 12:21:03 UTC] {subprocess.py:96} INFO - Command exited with return code 1
[2023-08-14, 12:21:03 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/bash.py", line 195, in execute
f'Bash command failed. The command returned a non-zero exit code {result.exit_code}.'
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.
[2023-08-14, 12:21:03 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=elt_reddit_pipeline, task_id=extract_reddit_data, execution_date=20230813T000000, start_date=20230814T122055, end_date=20230814T122103
[2023-08-14, 12:21:03 UTC] {standard_task_runner.py:97} ERROR - Failed to execute job 3 for task extract_reddit_data (Bash command failed. The command returned a non-zero exit code 1.; 103)
[2023-08-14, 12:21:03 UTC] {local_task_job.py:156} INFO - Task exited with return code 1
[2023-08-14, 12:21:03 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
I tried to run locally the reddit_extract_data.py script and it worked and produced the csv file. The s3_data_upload_etl.py also worked when ran locally and the csv file was loaded to s3 bucket as it should be. However there is no redshift_data_upload_etl.py script and I cannot then upload the table to Redshift.
Can you please help me to solve this issues. Thank you in advance for your comprehension.
Metadata
Metadata
Assignees
Labels
No labels