Skip to content
This repository was archived by the owner on Apr 6, 2026. It is now read-only.
This repository was archived by the owner on Apr 6, 2026. It is now read-only.

Implement data sentinel #1

@htmlboss

Description

@htmlboss

Intro

The idea behind this script is to compare incoming netcdf files against a set of known, working files in the Navigator. This will help with catching changes in metadata that have surprised us consistently over the past few years. As of April 30 2020, there will be no core developers around so this will help with automating some of the day-to-day maintenance of the Navigator and reduce the workload on @dwayne-hart. Basically if the script doesn't like any data files, not our problem and they won't be ingested into the Navigator.

Yes, I'm using python here so I don't get yelled at...

Architecture

  • Command-line script with 3 required arguments: dataset name, template file, and incoming files.
  • Super minimal conda environment (called data-sentinel), with netcdf4 being the only primary dependency.
  • Dataset name: corresponds to the dataset_key entry in the template file.
  • Template file schema:
{
    "dataset_key": {
        "rules": {
            "check_attrs": [],
            "check_dimensions_identical": true OR false,
            "check_unlimited_time_dim": true OR false,
            "check_variables": true OR false
        },
        "known_files": {
            "filename_regex_pattern": "path_to_valid_file"
        }
    }
}
  • Incoming files: A list of file paths, nothing complicated:
file1.nc
file2.nc
...

Usage

  • main.py --dataset DATASET --template TEMPLATE [incoming]
    [incoming] is a list of the files to be tested. This argument may a file path, or standard input (pipe).
    e.g.:
python main.py --dataset giops_daily --template ./my_template.json < incoming_files

Behaviour

  • Script will read in the template file (json) and iterate over the incoming files, match the name of each files against the filename_regex_pattern's, and compare their metadata to the known file as defined in the template file.
  • Will provide output as to how many (and which) files passed, failed, or didn't match to any filename_regex_pattern.

Infrastructure

  • This script will sit in front of the index tool. As in, when we download new data files, this script will be run to create a list of passing files, which in turn will be indexed. Any files that failed can be handled case-by-case.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions