Skip to content

Latest commit

 

History

History
468 lines (335 loc) · 24 KB

File metadata and controls

468 lines (335 loc) · 24 KB
contributing-docs/05_pull_requests.rst

Pull Requests

This document describes how you can create Pull Requests (PRs) and describes coding standards we use when implementing them.

Once a PR is merged, the commit author's (and any co-author's) name and email address become permanently public as part of the commit history and cannot be changed or removed afterward.

GitHub provides a built-in option to anonymize your commit email address (Setting your commit email address) by using a GitHub-provided email; however, this setting applies only to future commits and cannot be applied retrospectively.

On computers used with multiple GitHub accounts (for example, shared or work computers), the local Git configuration may be set to a different name or email than expected. This can result in unintended personal or work-related details being permanently published. If you are using the Git CLI, you can verify your current configuration by running the following commands in your terminal within your local Airflow repository:

git config --get user.name
git config --get user.email

To keep your email address private on GitHub, enable Keep my email addresses private in your GitHub settings and configure git to use your GitHub-provided noreply address (typically <id>+<username>@users.noreply.github.com) as user.email for future commits.

If you use the Git CLI, you can set the appropriate configuration by running the following commands:

Notes:

  • GitHub's anonymization does not apply to co-author entries or to email addresses explicitly included in the commit message.
  • GitHub Desktop and other third-party Git clients manage author and email settings separately; verify them using the tool's own configuration or preferences.

Before opening a pull request, please ensure you are comfortable with the author and co-author details that will be publicly visible. Maintainers might point out unexpected personal information before merging, but they are not obliged to do so.

Before you submit a Pull Request (PR) from your forked repo, check that it meets these guidelines:

  • Start with Draft: Until you are sure that your PR passes all the quality checks and tests, keep it in Draft status. This will signal to maintainers that the PR is not yet ready for review and it will prevent maintainers from accidentally merging it before it's ready. Once you are sure that your PR is ready for review, you can mark it as "Ready for review" in the GitHub UI. Our regular check will convert all PRs from non-collaborators that do not pass our quality gates to Draft status, so if you see that your PR is in Draft status and you haven't set it to Draft. Check the comments to see what needs to be fixed.

  • Include tests, either as doctests, unit tests, or both, to your pull request.

    The Airflow repo uses GitHub Actions to run the tests and codecov to track coverage. You can set up both for free on your fork. It will help you make sure you do not break the build with your PR and that you help increase coverage. Also we advise to install locally prek hooks to apply various checks, code generation and formatting at the time you make a local commit - which gives you near-immediate feedback on things you need to fix before you push your code to the PR, or in many case it will even fix it for you locally so that you can add and commit it straight away.

  • Follow our project's Coding style and best practices. Usually we attempt to enforce the practices by having appropriate prek hooks. There are checks amongst them that aren't currently enforced programmatically (either because they are too hard or just not yet done).

  • Maintainers will not merge a PR that regresses linting or does not pass CI tests (unless you have good justification that it a transient error or something that is being fixed in other PR).

  • Maintainers will not merge PRs that have unresolved conversation.

  • We prefer that you rebase your PR (and do it quite often) rather than merge. It leads to easier reviews and cleaner changes where you know exactly what changes you've done. You can learn more about rebase vs. merge workflow in Rebase and merge your pull request and Rebase your fork. Make sure to resolve all conflicts during rebase.

  • When merging PRs, Maintainer will use Squash and Merge which means then your PR will be merged as one commit, regardless of the number of commits in your PR. During the review cycle, you can keep a commit history for easier review, but if you need to, you can also squash all commits to reduce the maintenance burden during rebase.

  • Add an Apache License header to all new files. If you have prek installed, prek will do it automatically for you. If you hesitate to install prek for your local repository - for example because it takes a few seconds to commit your changes, this one thing might be a good reason to convince anyone to install prek.

  • If your PR adds functionality, make sure to update the docs as part of the same PR, not only code and tests. Docstring is often sufficient. Make sure to follow the Sphinx compatible standards.

  • Make sure your code fulfills all the static code checks we have in our code. The easiest way to make sure of that is - again - to install prek hooks.

  • Make sure your PR is small and focused on one change only - avoid adding unrelated changes, mixing adding features and refactoring. Keeping to that rule will make it easier to review your PR and will make it easier for release managers if they decide that your change should be cherry-picked to release it in a bug-fix release of Airflow. If you want to add a new feature and refactor the code, it's better to split the PR to several smaller PRs. It's also quite a good and common idea to keep a big Draft PR if you have a bigger change that you want to make and then create smaller PRs from it that are easier to review and merge and cherry-pick. It takes a long time (and a lot of attention and focus of a reviewer to review big PRs so by splitting it to smaller PRs you actually speed up the review process and make it easier for your change to be eventually merged.

  • To learn more about cherry-pick of PRs, refer to the Airflow 3 development documentation.

  • Run relevant tests locally before opening PR. Often tests are placed in the files that are corresponding to the changed code (for example for airflow/cli/cli_parser.py changes you have tests in tests/cli/test_cli_parser.py). However there are a number of cases where the tests that should run are placed elsewhere - you can either run tests for the whole TEST_TYPE that is relevant (see breeze testing core-tests --help or breeze testing providers-tests --help output for available test types for each of the testing commands) or you can run all tests, or eventually you can push your code to PR and see results of the tests in the CI.

  • You can use any supported python version to run the tests, but the best is to check if it works for the oldest supported version (Python 3.10 currently). In rare cases tests might fail with the oldest version when you use features that are available in newer Python versions. For that purpose we have airflow.compat package where we keep back-ported useful features from newer versions.

  • Adhere to guidelines for commit messages described in this article. This makes the lives of those who come after you (and your future self) a lot easier.

Every open PR must meet the following minimum quality criteria before maintainers will review it. PRs that do not meet these criteria may be automatically converted to draft status by project tooling, with a comment explaining what needs to be fixed.

  1. Descriptive title — The PR title must clearly describe the change. Generic titles such as "Fix bug", "Update code", "Changes", single-word titles, or titles that only reference an issue number (e.g. "Fixes #12345") do not meet this bar.
  2. Meaningful description — The PR body must contain a meaningful description of what the PR does and why. An empty body, a body consisting only of the PR template checkboxes/headers with no added text, or a body that merely repeats the title is not sufficient.
  3. Passing static checks — The PR's static checks (pre-commit / ruff / mypy) must pass. You can run them locally with prek run --from-ref main before pushing.
  4. Gen-AI disclosure — If the PR was created with the assistance of generative AI tools, the description must include a disclosure (see Gen-AI Assisted contributions below).
  5. Coherent changes — The PR should contain related changes only. Completely unrelated changes bundled together will be flagged.

What happens when a PR is converted to draft?

  • The comment informs you what you need to do.
  • Fix each issue, then mark the PR as "Ready for review" in the GitHub UI - but only after making sure that all the issues are fixed.
  • There is no rush — take your time and work at your own pace. We appreciate your contribution and are happy to wait for updates.
  • Maintainers will then proceed with a normal review.

Converting a PR to draft is not a rejection — it is an invitation to bring the PR up to the project's standards so that maintainer review time is spent productively.

For details on how maintainers use tooling to triage and review PRs, see Maintainer PR Triage and Review.

What happens when a PR is closed for quality violations?

If a contributor has more than 3 open PRs that are flagged for quality issues, maintainers may choose to close the PR instead of converting it to draft. Closed PRs receive the closed because of multiple quality violations label and a comment listing the violations. Contributors are welcome to open a new PR that addresses the issues listed in the comment.

What happens when suspicious changes are detected?

When maintainers review a PR's diff before approving CI workflow runs and determine that it contains suspicious changes (e.g. attempts to exfiltrate secrets, modify CI pipelines maliciously, or inject harmful code), all open PRs by the same author will be closed and labeled suspicious changes detected. A comment is posted on each PR explaining that the closure was triggered by suspicious changes found in the flagged PR.

Generally, it's fine to use Gen-AI tools to help you create Pull Requests for Apache Airflow as long as you adhere to the following guidelines:

  • Follow the Apache Software Foundation (ASF) Generative Tooling guidelines.
  • Ensure that you review and understand all code generated by Gen-AI tools before including it in your PR - do not blindly trust the generated code.
  • Make sure that the generated code adheres to the project's coding standards and best practices described above
  • Make sure to run all relevant static checks and tests locally to verify that the generated code works as intended and does not introduce any issues.
  • State in your PR description that you have used Gen-AI tools to assist in creating the PR.
  • Be prepared to explain and justify the use of Gen-AI tools if asked by project maintainers or reviewers.
  • Remember that the final responsibility for the code in your PR lies with you, regardless of whether it was generated by a tool or written by you.
  • By contributing code generated by Gen-AI tools, you agree to comply with the project's licensing terms and any applicable laws and regulations.
  • Blindly copy & pasting code from Gen-AI tools is detrimental as it might introduce security and stability risks to the project. Maintainers that spot such behaviours will have no choice but to close the related PRs. It adds extra burden on the maintainers and doesn't help the project. The contributor reputation is also impacted as maintainers will lose confidence and might block the user from making further contributions.
  • Review your code to make sure it does not contain unrelated changes. Often Gen-AI tools might introduce changes that are not related to the problem you are trying to solve. Such unrelated changes should be removed from the PR before it is submitted for review. Relying on maintainers to spot such unrelated changes is unfair and adds extra burden on them.

When a contributor does not follow these guidelines, maintainers might decide to close the PR (and all the PRs of that contributor) without reviewing them - to avoid extra burden on the maintainers and to protect the project from potential risks of merging unvetted code by a tired maintainer. This should be accompanied by a comment explaining the reason for closing the PR and pointing to this section of the documentation.

If the contributor repeatedly ignores these guidelines, PMC members might decide to block the contributor from making further contributions to the project, this is a last resort measure to protect the project from potential risks of unvetted code and to avoid overburdening the maintainers. Such blocking is accompanied with a LAZY CONSENSUS vote amongst the PMC members to make sure that the decision is agreed upon and not vetoed by any of the PMC members and it is kept in the records of the Apache Software Foundation. In extreme cases the PMC might request the ASF Infrastructure team to block the contributor at the Organization level - for all ASF projects.

If the contributor is evidently spamming the project with the content that is violating GitHub terms and condition, maintainers might decide to report such behaviour to GitHub for further actions, which often results in deletion of the user account by GitHub or blocking the user from making further contributions at GitHub level.

We have decided to enable the requirement to resolve all the open conversations in a PR in order to make it merge-able. You will see in the status of the PR that it needs to have all the conversations resolved before it can be merged.

The goal is to make it easier to see when there are some conversations that are not resolved for everyone involved in the PR - author, reviewers and maintainers who try to figure out if the PR is ready to merge and - eventually - merge it. The goal is also to use conversations more as a "soft" way to request changes and limit the use of Request changes status to only those cases when the maintainer is sure that the PR should not be merged in the current state. This leads to faster review/merge cycle and less problems with stalled PRs that have Request changes status but all the issues are already solved.

Most of our coding style rules are enforced programmatically by ruff and mypy, which are run automatically with static checks and on every Pull Request (PR), but there are some rules that are not yet automated and are more Airflow specific or semantic than style.

Our community agreed that to various reasons we do not use assert in production code of Apache Airflow. For details check the relevant mailing list thread.

In other words instead of doing:

assert some_predicate()

you should do:

if not some_predicate():
    handle_the_case()

The one exception to this is if you need to make an assert for type checking (which should be almost a last resort) you can do this:

if TYPE_CHECKING:
    assert isinstance(x, MyClass)

Explicit is better than implicit. If a function accepts a session parameter it should not commit the transaction itself. Session management is up to the caller.

To make this easier, there is the create_session helper:

from sqlalchemy.orm import Session

from airflow.utils.session import create_session


def my_call(x, y, *, session: Session):
    ...
    # You MUST not commit the session here.


with create_session() as session:
    my_call(x, y, session=session)

Warning

DO NOT add a default to the session argument unless @provide_session is used.

If this function is designed to be called by "end-users" (i.e. Dag authors) then using the @provide_session wrapper is okay:

from sqlalchemy.orm import Session

from airflow.utils.session import NEW_SESSION, provide_session


@provide_session
def my_method(arg, *, session: Session = NEW_SESSION):
    ...
    # You SHOULD not commit the session here. The wrapper will take care of commit()/rollback() if exception

In both cases, the session argument is a keyword-only argument. This is the most preferred form if possible, although there are some exceptions in the code base where this cannot be used, due to backward compatibility considerations. In most cases, session argument should be last in the argument list.

If you wish to compute the time difference between two events with in the same process, use time.monotonic(), not time.time() nor timezone.utcnow().

If you are measuring duration for performance reasons, then time.perf_counter() should be used. (On many platforms, this uses the same underlying clock mechanism as monotonic, but perf_counter is guaranteed to be the highest accuracy clock on the system, monotonic is simply "guaranteed" to not go backwards.)

If you wish to time how long a block of code takes, use Stats.timer() -- either with a metric name, which will be timed and submitted automatically:

from airflow.sdk.observability.stats import Stats

...

with Stats.timer("my_timer_metric"):
    ...

or to time but not send a metric:

from airflow.sdk.observability.stats import Stats

...

with Stats.timer() as timer:
    ...

log.info("Code took %.3f ms", timer.duration)

For full docs on timer() check out `shared/observability/src/airflow_shared/observability/metrics/base_stats_logger.py`_.

If the start_date of a duration calculation needs to be stored in a database, then this has to be done using datetime objects. In all other cases, using datetime for duration calculation MUST be avoided as creating and diffing datetime operations are (comparatively) slow.

Airflow Operators might have some fields added to the list of template_fields. Such fields should be set in the constructor (__init__ method) of the operator and usually their values should come from the __init__ method arguments. The reason for that is that the templated fields are evaluated at the time of the operator execution and when you pass arguments to the operator in the Dag, the fields that are set on the class just before the execute method is called are processed through templating engine and the fields values are set to the result of applying the templating engine to the fields (in case the field is a structure such as dict or list, the templating engine is applied to all the values of the structure).

That's why we expect two things in case of template fields:

  • with a few exceptions, only self.field = field should be happening in the operator's constructor
  • validation of the fields should be done in the execute method, not in the constructor because in the constructor, the field value might be a templated value, not the final value.

The exceptions are cases where we want to assign empty default value to a mutable field (list or dict) or when we have a more complex structure which we want to convert into a different format (say dict or list) but where we want to keep the original strings in the converted structure.

In such cases we can usually do something like this

def __init__(self, *, my_field: list[str] = None, **kwargs):
    super().__init__(**kwargs)
    my_field = my_field or []
    self.my_field = my_field

The reason for doing it is that we are working on a cleaning up our code to have prek hook that will make sure all the cases where logic (such as validation and complex conversion) is not done in the constructor are detected in PRs.

Our community has decided to stop adding new raise AirflowException and to adopt the following practices when an exception is necessary. For details check the relevant mailing list thread.

  1. In most cases, we should prioritize using Python's standard exceptions (e.g., ValueError, TypeError, OSError) instead of wrapping everything in AirflowException.
  2. Within airflow-core, we should define and utilize more specific exception classes under airflow-core/src/airflow/exceptions.py.
  3. For provider-specific implementations, exceptions should be defined within providers/<provider>/src/airflow/providers/<provider>/exceptions.py.

The use of points 2 and 3 should only be considered when point 1 is inappropriate, which should be a rare occurrence.

In other words instead of doing:

if key not in conf:
    raise AirflowException(f"Required key {key} is missing")

you should do:

if key not in conf:
    raise ValueError(f"Required key {key} is missing")

If you want to learn what are the options for your development environment, follow to the Development environments document.