Skip to content

Conversation

@tennlee
Copy link
Collaborator

@tennlee tennlee commented Jun 7, 2025

This is work in progress to advance data pipelines which help to translate between weather timescales and climate timescales. Climate timescales are seemingly modelling typically at monthy duration, using a 360-day calendar which is new for me. ERA5 is a re-analysis product available (at high resolutions) at hourly resolution. The challenge is to work out how to relate these data sets and capture that in a pipeline. The existing tutorial on working with climate data approaches this but provides and incomplete solution with respect to presenting samples to an ML pipeline.

The work thus far enhances some of the pipeline code needed to do this. However, we need to add an aggregation process to the temporal retrieval which is used when accessing the ERA5 data.

I am putting the work in progress up as a pull request and will pose the question to some of the other collaborators as to how best to progress from here.



[project]
name = "pyearthtools-models"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't relevant to this PR and was an accidental include

@tennlee
Copy link
Collaborator Author

tennlee commented Jun 7, 2025

@jennan @millerjoel I'd be interested in your take on this one, it's not clear exactly what to do next. I might take another run at it tomorrow. Do you have some CMIP5 data on hand to set up the same data archive and look at the problem together?

@nikeethr nikeethr marked this pull request as draft June 10, 2025 02:25

time_query = str(Petdt(querytime))
if isinstance(data.coords[time_dim].values[0], cftime.datetime):
time_query = cftime.datetime(querytime.year,
Copy link
Collaborator

@nikeethr nikeethr Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] I'm pretty bad at naming, however this is more of a quality of life/functional change. It likely is something that needs to be explored throughout the codebase, rather than a suggested change in this PR - so apologies for singling this one (especially since it might be a wip).

I think it may be easier for search and general ease of dev, if the manipulated variable and the original variable share a reasonably similar substring. E.g. pet_querytime or cf_querytime or str_querytime etc.

querytime itself can be changed to time_query - no issue with that, just as long as sufficiently similar in both versions of the variable.


# We need to make interp_like ignore the time dimension
# TODO - work out if we want some options here to specify which dims to preserve
if 'time' in self.reference_dataset.coords:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dropping here that, there are some faster interp implementations I have in mind (with the benefit of also being more accurate). Happy to discuss.


## A Quick Hands-On Approach

This guide is suitable for scientists or anyone else who wants to start trying things quickly to establish their first model and make a first attempt. More detail is provided below with more detail on the nuances and alternatives for each step.
Copy link
Collaborator

@nikeethr nikeethr Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This guide is suitable for scientists or anyone else who wants to start trying things quickly to establish their first model and make a first attempt. More detail is provided below with more detail on the nuances and alternatives for each step.
This guide is suitable for scientists or anyone else who wants to quickly establish their first model, hands-on. The following describes the nuances and alternatives for each step, in detail.


1. Use [https://pyearthtools.readthedocs.io/en/latest/notebooks/tutorial/FourCastMini_Demo.html](https://pyearthtools.readthedocs.io/en/latest/notebooks/tutorial/FourCastMini_Demo.html) as a template for what to do.
1. Determine the parameters you want to model, such as `temperature` or `wind`. When these become part of the neural network, they will be called *channels*.
2. Determine the data source they come from, such as ERA5 or another model or re-analysis source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Determine the data source they come from, such as ERA5 or another model or re-analysis source
2. Determine the data source they come from, such as ERA5 or another model or re-analysis source.

1. Use [https://pyearthtools.readthedocs.io/en/latest/notebooks/tutorial/FourCastMini_Demo.html](https://pyearthtools.readthedocs.io/en/latest/notebooks/tutorial/FourCastMini_Demo.html) as a template for what to do.
1. Determine the parameters you want to model, such as `temperature` or `wind`. When these become part of the neural network, they will be called *channels*.
2. Determine the data source they come from, such as ERA5 or another model or re-analysis source
3. Develop a `pipeline` which includes data normalisation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Develop a `pipeline` which includes data normalisation
3. Develop a `pipeline` which includes data normalisation.

2. Determine the data source they come from, such as ERA5 or another model or re-analysis source
3. Develop a `pipeline` which includes data normalisation
4. Using a bundled model, configure that model to the size required. This may only required the adjustment of `img_size`, `in_channels` and `out_channels` to match the size of your data. The grid dimension must be a multiple of four for this model, so you may need to crop or regrid your data to match. In future, a standard approach without this limitation will be added.
5. Run some number of training steps (using the `.fit` method) and visualise the outputs. Visualising predictions from the trained model every 3000 steps or so provides useful insight into the training process as well as helping see when the model might be fully trained. *There is no definite answer to how much training will be required. If your model isn't showing any progress at all after a couple of epochs, there may be a problem. Some models will start to show progress after 3000 steps.*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Run some number of training steps (using the `.fit` method) and visualise the outputs. Visualising predictions from the trained model every 3000 steps or so provides useful insight into the training process as well as helping see when the model might be fully trained. *There is no definite answer to how much training will be required. If your model isn't showing any progress at all after a couple of epochs, there may be a problem. Some models will start to show progress after 3000 steps.*
5. Run some number of training steps (using the `.fit` method) and visualise the outputs. Visualising predictions from the trained model every 3000 steps or so is useful in providing insight into the training process, and for understanding when the model might be fully trained. *There is no definite answer to how much training will be required. If your model isn't showing any progress at all after a couple of epochs, there may be a problem. Some models will start to show progress after 3000 steps.*

match = False

if path["interval"] not in self.interval:
if path["interval"] not in self.interval[0]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a len(...) == 1 check on self.interval to be consistent with other places this sort of comparison is done?



class FutureFaker:
class ResultCache:
Copy link
Collaborator

@nikeethr nikeethr Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand this isn't part of this PR, but any reason not to use functools and its cache options?

Not sure if this is thread-safe, manages duplicates or even guarantees that the referential counters are appropriately tracked. (Not that functools does this, but there is likely an equivilent that does in some other package, if it doesn't.)

For serial implementations and testing/dev it probably is okay, but since this is coming under parallel.py I'm raising this comment, to gauge the intent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants