Skip to content

Task Instance: extract_reddit_data failed to execute #1

@HazmilH

Description

@HazmilH

Hi @joyceannie,

Excellent work. I learn a lot from your work. I tried to reproduce the pipeline and when I connected to airflow GUI and inspect the DAG, the first task, extract_reddit_data failed with this error below :

*** Reading local file: /opt/airflow/logs/dag_id=elt_reddit_pipeline/run_id=scheduled__2023-08-13T00:00:00+00:00/task_id=extract_reddit_data/attempt=1.log
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: elt_reddit_pipeline.extract_reddit_data scheduled__2023-08-13T00:00:00+00:00 [queued]>
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: elt_reddit_pipeline.extract_reddit_data scheduled__2023-08-13T00:00:00+00:00 [queued]>
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1356} INFO - 
--------------------------------------------------------------------------------
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1357} INFO - Starting attempt 1 of 1
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1358} INFO - 
--------------------------------------------------------------------------------
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1377} INFO - Executing <Task(BashOperator): extract_reddit_data> on 2023-08-13 00:00:00+00:00
[2023-08-14, 12:20:55 UTC] {standard_task_runner.py:52} INFO - Started process 103 to run task
[2023-08-14, 12:20:55 UTC] {standard_task_runner.py:79} INFO - Running: ['***', 'tasks', 'run', 'elt_reddit_pipeline', 'extract_reddit_data', 'scheduled__2023-08-13T00:00:00+00:00', '--job-id', '3', '--raw', '--subdir', 'DAGS_FOLDER/elt_reddit_pipeline.py', '--cfg-path', '/tmp/tmpvs1xefxw', '--error-file', '/tmp/tmp4abobruo']
[2023-08-14, 12:20:55 UTC] {standard_task_runner.py:80} INFO - Job 3: Subtask extract_reddit_data
[2023-08-14, 12:20:55 UTC] {task_command.py:370} INFO - Running <TaskInstance: elt_reddit_pipeline.extract_reddit_data scheduled__2023-08-13T00:00:00+00:00 [running]> on host 56fecb3fcaf8
[2023-08-14, 12:20:55 UTC] {taskinstance.py:1571} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=elt_reddit_pipeline
AIRFLOW_CTX_TASK_ID=extract_reddit_data
AIRFLOW_CTX_EXECUTION_DATE=2023-08-13T00:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2023-08-13T00:00:00+00:00
[2023-08-14, 12:20:55 UTC] {subprocess.py:62} INFO - Tmp dir root location: 
 /tmp
[2023-08-14, 12:20:55 UTC] {subprocess.py:74} INFO - Running command: ['bash', '-c', 'python /opt/***/extraction/reddit_extract_data.py 2023-08-14']
[2023-08-14, 12:20:55 UTC] {subprocess.py:85} INFO - Output:
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - Successfully connected to Reddit
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - Traceback (most recent call last):
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/opt/***/extraction/reddit_extract_data.py", line 99, in <module>
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     main()
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/opt/***/extraction/reddit_extract_data.py", line 48, in main
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     load_to_csv(transformed_data)
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/opt/***/extraction/reddit_extract_data.py", line 95, in load_to_csv
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     data.to_csv(f'tmp/{output_name}.csv', index = False)
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/home/***/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 3482, in to_csv
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     storage_options=storage_options,
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/home/***/.local/lib/python3.7/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     csv_formatter.save()
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/home/***/.local/lib/python3.7/site-packages/pandas/io/formats/csvs.py", line 243, in save
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     storage_options=self.storage_options,
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -   File "/home/***/.local/lib/python3.7/site-packages/pandas/io/common.py", line 707, in get_handle
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO -     newline="",
[2023-08-14, 12:21:03 UTC] {subprocess.py:92} INFO - FileNotFoundError: [Errno 2] No such file or directory: 'tmp/2023-08-14.csv'
[2023-08-14, 12:21:03 UTC] {subprocess.py:96} INFO - Command exited with return code 1
[2023-08-14, 12:21:03 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/bash.py", line 195, in execute
    f'Bash command failed. The command returned a non-zero exit code {result.exit_code}.'
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.
[2023-08-14, 12:21:03 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=elt_reddit_pipeline, task_id=extract_reddit_data, execution_date=20230813T000000, start_date=20230814T122055, end_date=20230814T122103
[2023-08-14, 12:21:03 UTC] {standard_task_runner.py:97} ERROR - Failed to execute job 3 for task extract_reddit_data (Bash command failed. The command returned a non-zero exit code 1.; 103)
[2023-08-14, 12:21:03 UTC] {local_task_job.py:156} INFO - Task exited with return code 1
[2023-08-14, 12:21:03 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

I tried to run locally the reddit_extract_data.py script and it worked and produced the csv file. The s3_data_upload_etl.py also worked when ran locally and the csv file was loaded to s3 bucket as it should be. However there is no redshift_data_upload_etl.py script and I cannot then upload the table to Redshift.

Can you please help me to solve this issues. Thank you in advance for your comprehension.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions