✨ Add `adls_to_df()` and `adls_list()` Prefect tasks by Diego-H-S · Pull Request #1037 · dyvenia/viadot

Diego-H-S · 2024-09-18T06:28:44Z

Summary

Added prefect 2.0 task into adls utils. Those tasks are, to list files given a path and create a pandas data frame from an adls path file.

Importance

Required by the migration project.

Checklist

This PR:

follows the guidelines laid out in CONTRIBUTING.md
links relevant issue(s)
adds/updates tests (if appropriate)
adds/updates docstrings (if appropriate)
adds an entry in CHANGELOG.md

trymzet · 2024-09-18T13:18:51Z

+def adls_to_df(
+    path: str,
+    sep: str = "\t",
+    credentials_secret: str | None = None,
+    config_key: str | None = None,
+) -> pd.DataFrame:
+    r"""Load file(s) from the Azure Data Lake to a pandas DataFrame.
+
+    Note: Currently supports CSV and parquet files.
+
+    Args:
+        path (str): The path from which to load the DataFrame.
+        sep (str, optional): The separator to use when reading a CSV file.
+            Defaults to "\t".
+        credentials_secret (str, optional): The name of the Azure Key Vault secret
+            storing the credentials.
+        config_key (str, optional): The key in the viadot config holding relevant
+            credentials.
+
+    Raises:
+        MissingSourceCredentialsError: If credentials were not provided.
+
+    Returns:
+        pd.DataFrame: The HTTP response object.
+    """
+    logger = get_run_logger()
+
+    if not (credentials_secret or config_key):
+        raise MissingSourceCredentialsError
+
+    credentials = get_credentials(credentials_secret)
+    lake = AzureDataLake(credentials=credentials, config_key=config_key)
+
+    full_dl_path = str(Path(credentials["ACCOUNT_NAME"], path))
+    logger.info(f"Downloading data from {full_dl_path} to a DataFrame...")
+
+    name = Path(path).stem
+    suffixes = "".join(Path(path).suffixes)
+    file_name = f"{name}{suffixes}"
+
+    lake.download(to_path=file_name, from_path=path, recursive=False)
+
+    if ".csv" in suffixes:
+        df = pd.read_csv(file_name, sep=sep)
+    elif ".parquet" in suffixes:
+        df = pd.read_parquet(file_name)
+
+    Path.unlink(file_name)
+
+    logger.info("Successfully loaded data.")
+
+    return df


There's already an ADLS.to_df() method than you can use for this task - no need to reinvent the wheel

That is not working. I created it because Fabio asked me to.

What's not working? Can you link the issue? Also, you should fix the issue and not add another function that does the same thing.

trymzet · 2024-09-18T13:21:34Z

+        path (str): The path to the directory which contents you want to list.
+        recursive (bool, optional): If True, recursively list all subdirectories
+            and files. Defaults to False.
+        file_to_match (str, optional): If exist it only returns files with that name.


What is this param for? If you want to check if a file exists, you can use the existing AzureDataLake.exists() method (and add a task for it if needed). Also, any logic should be defined in viadot source, tasks and flows should only use viadot sources without adding any additional ingestion logic on top.

This is a task to list files in ADLS. I transcripted from viadot 1, because Fabio needed it. If you want to leave it in another part, Talk to Fabio.

It doesn't matter where the code is from, I'm only checking if it's good enough to merge or not. You forgot to link the issue?

Rafalz13 · 2024-09-26T15:11:44Z

@fdelgadodyvenia Should it be finished or removed? Where this functionality is needed?

trymzet

Pending more info

Diego-H-S added 4 commits September 12, 2024 14:51

✨ added new tasks to adls.py.

33f6a1f

⚡️ updated adls functions names.

7a816d0

🐛 bypassed pyarrow error in azure to df.

e2fd5dc

📝 updated commented rows.

9e475ae

Diego-H-S requested a review from trymzet September 18, 2024 06:29

Diego-H-S added 2 commits September 18, 2024 08:36

🎨 adde precommit suggestions.

0bea12c

📝 updated comments.

200236d

trymzet requested changes Sep 18, 2024

View reviewed changes

Diego-H-S requested a review from trymzet September 19, 2024 10:37

trymzet changed the title ~~Adls improvements 2.0~~ ✨ Add adls_to_df() and adls_list() Prefect tasks Sep 27, 2024

trymzet requested changes Oct 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add `adls_to_df()` and `adls_list()` Prefect tasks#1037

✨ Add `adls_to_df()` and `adls_list()` Prefect tasks#1037
Diego-H-S wants to merge 6 commits into2.0from
adls_improvements_2.0

Diego-H-S commented Sep 18, 2024

Uh oh!

trymzet Sep 18, 2024

Uh oh!

Diego-H-S Sep 19, 2024

Uh oh!

trymzet Sep 20, 2024 •

edited

Loading

Uh oh!

trymzet Sep 18, 2024 •

edited

Loading

Uh oh!

Diego-H-S Sep 19, 2024

Uh oh!

trymzet Sep 20, 2024

Uh oh!

Rafalz13 commented Sep 26, 2024

Uh oh!

trymzet left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Diego-H-S commented Sep 18, 2024

Summary

Importance

Checklist

Uh oh!

trymzet Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

Diego-H-S Sep 19, 2024

Choose a reason for hiding this comment

Uh oh!

trymzet Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trymzet Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Diego-H-S Sep 19, 2024

Choose a reason for hiding this comment

Uh oh!

trymzet Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

Rafalz13 commented Sep 26, 2024

Uh oh!

trymzet left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trymzet Sep 20, 2024 •

edited

Loading

trymzet Sep 18, 2024 •

edited

Loading