Raw transfer

Service to transfer raw telescope data from a local filesystem to a remote data centre.

Installation

Clone the repo:

git clone https://github.com/GOTO-OBS/rawtransfer
cd rawtransfer

Install and setup a python (3.7+) virtual environment inside this directory (or your favourite place):

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

Install rawtransfer

pip install .

This will create the raw-transfer console script accessible when your .venv is active. It can be run directly but is better suited as a service. To get help on all possible arguments run raw-transfer -h.

Configuration

In order to run as an unattended service, but also even from the command line, you will need to:

set up passwordless ssh access to the remote data centre.
populate the ~/.pgpass file of the user to be running raw-transfer with the database credentials.

Running rawtransfer

The service can be ran in two modes:

A watcher mode, best implemented by a systemd service (or similar)
- This mode will continuously monitor a directory for changes and initiate real-time transfer of files
A one-off mode
- This mode will perform a single catch-up operation and transfer any missing files that were created during a given time-frame

Watcher (set up as a systemd service)

Create the file /etc/systemd/system/rawtransfer.service with the below contents. You will need to replace the placeholder entries given in {}.

[Unit]
Description=Raw data transfer
After=network.target postgresql.service
Wants=postgresql.service

[Service]
User={user}
Environment=PYTHONUNBUFFERED=1
Type=simple
ExecStart=/bin/bash -ac '. {/path/to/}.venv/bin/activate && raw-transfer {arguments}'
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

As an example:

[Unit]
Description=Raw data transfer
After=network.target postgresql.service
Wants=postgresql.service

[Service]
User=goto
Environment=PYTHONUNBUFFERED=1
Type=simple
ExecStart=/home/goto/rawtransfer/.venv/bin/raw-transfer /data/goto/images goto@goto-gateway:/export/gotodata2/raw \
                 --patterns "t?_r*_ut?.fits" \
                 --metadata-regex "t(?P<tel>\d)_r(?P<run>\d+)_ut(?P<ut>\d).fits" \
                 --header-card DATE-OBS \
                 --db-host goto-gateway \
                 --db-user goto \
                 --db-name rawtransfer \
                 --db-port 5432 \
                 --db-table goto \
                 --copy-timeout 60 \
                 --workers 8 \
                 --ssh-controlmaster \
                 --canary-interval 15 \
                 --backfill-interval 1200 \
                 --backfill-hours 12 \
                 --backfill-recency-window 600 \
                 --overwrite-unpacked-interval 1200 \
                 --overwrite-unpacked-hours 12 \
                 --overwrite-unpacked-recency-window 7200 \
                 --logfile /var/log/rawtransfer/rawtransfer.log \
                 --slack-webhook-url "https://hooks.slack.com/services/super/secret" \
                 --verbose
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Note: you may need to set up the logfile directory with correct permissions

Now enable and start the systemd service:

sudo systemctl enable rawtransfer
sudo systemctl start rawtransfer

Check things are running smoothly with:

sudo systemctl status rawtransfer

Follow the stream logging with:

sudo journalctl -fu rawtransfer

One-off

From the terminal one can run:

source /path/to/rawtransfer/.venv/bin/activate

raw-transfer /data/goto/images goto@goto-gateway:/export/gotodata2/raw \
                 --backfill-only \
                 --patterns "t?_r*_ut?.fits" \
                 --metadata-regex "t(?P<tel>\d)_r(?P<run>\d+)_ut(?P<ut>\d).fits" \
                 --header-card DATE-OBS \
                 --db-host goto-gateway \
                 --db-user goto \
                 --db-name rawtransfer \
                 --db-port 5432 \
                 --db-table goto \
                 --copy-timeout 60 \
                 --workers 8 \
                 --ssh-controlmaster \
                 --backfill-hours 24 \
                 --verbose

The important arguments here being --backfill-only, which instructs the program to only do a one-off backfill operation then exist, and --backfill-hours 24, which indicates the age of the oldest file to transfer in hours as part of the back fill. See Gotchas with respect to packing files.

Requirements

Alongside python requirements (requirements.txt), the following binaries must be available on the local machine.

find (must be GNU and version >=4.3.3)

fpack

fitscheck (this is installed as part of astropy and so should be available anyway)

A first check of the service is to ensure these binaries are available.

Description

In plain-English the steps of the programme are given below, the example arguments from the above example systemd service file are added in parentheses to highlight their purpose.

Once running (as a service or otherwise), raw-transfer performs the following actions:

Sets up a task queue with multiple thread workers (--workers 8) to hold and perform the processing tasks.
Sets up a watchdog observer to monitor the source directory (/data/goto/images) for new files matching glob-style patterns (--patterns "t?_r*_ut?.fits"). The creation of these files should ideally be atomic - i.e. created by a mv-like operation after writing to a temporary filename. If not, and they are created and then written, the event handler will iteratively check the file size, waiting a short time between (--patience). Once the fize size is unchanging, it assumes the writing is complete. This latter method is more prone to backfire, as well as significantly slowing the overall processing.
A task to process the file and transfer it to a remote location (goto@goto-gateway:/export/gotodata2/raw) is passed to the task queue. Once a thread is available, it will start processing the file
- If the transfer relies on remote resources of a host (goto@goto-gateway) and/or database, it will check these are available.
- The file is checked for integrity and packed (compressed)
- If a database is being used to store transfers, a sha1 hash for the packed file is calculated and inserted into the database. This effectively "locks" that hash and would prevent any concurrent processing of the file.
- The remote sub-directories are created if needed, following a YYYY/MM/DD structure. The location of a file within this structure is determined by the value of the datetime stamp in its header card (--header-card DATE-OBS).
- A scp call is made to transfer the source file to the remote destination.
- If a database is being used to store transfers, the row matching the sha1 hash of the file is updated with metadata based on the filename (--metadata-regex "t(?P<tel>\d)_r(?P<run>\d+)_ut(?P<ut>\d).fits"), as well as time stamps for the file creation and transfer.
If the task fails due to the inability to access any remote resource (e.g. a link goes down to the host/database), then it is reinserted into the task queue, and the task queue is paused until the Canary returns successfully.
On an interval (--backfill-interval 1200) a back-filling function is called that will check for files in the source directory that have a modification time in the previous 12 hour window which do not have an entry in the transfer database (based on a search for matchin source_filepath entries). Any found to be missing will be added for processing. The use of this requires the use of a database.

Canary

On an interval (--canary-interval 15) a canary function is called to check that remote resources are accessible. Upon failure it raises an exception that triggers the pausing of the task queue until it next runs without exception.

Database

A database is used as the central truth of all transfers that occur. Its presence is not required, but it is strongly encouraged as it is used to verify successful transfers and prevent collisions in processing. The database operations were written with PostgreSQL in mind, but given the simple nature of the queries should work with other major (SQL) database flavours with little to no amendments.

The database given by the connection arguments (--db-host,--db-user, --db-name, --db-port) must have a table --db-table in the public schema (here we use the example "goto". This table should be defined as follows (or equivalently for whatever database flavour is being used):

create table goto (                                           -- table name to match that given in the arguments when running
sha1_hash varchar(40) primary key,                            -- sha1sum of raw data file
source_filepath text,                                         -- filepath to original raw data file
file_created_at timestamp without time zone,                  -- UTC time that original file was created
file_transferred_at timestamp without time zone,              -- UTC time that file transfer completed
filepath text,                                                -- filepath to transferred raw data file
tel smallint,                                                 -- [metadata column] tel number of raw data scraped from filename
run int,                                                      -- [metadata column] run number of raw data scraped from filename
ut smallint,                                                  -- [metadata column] ut number of raw data scraped from filename
created_at timestamp without time zone default now() not null -- UTC time that database record was created
);
create index on goto (source_filepath);                   -- Not necessary, but wise
create index on goto (file_transferred_at);               -- Not necessary, but wise

Note: The [metadata columns] examples above are populated based on the named capture groups in the --metadata-regex argument used in this readme. The names of these groups when running this service must match same-named columns in the database table.

Gotchas

The directory to monitor should ideally be local device in order to react to inotify events. Monitoring a remote device, mounted via NFS for example, is possible, but this will invoke a simple polling observer, and the events captured are prone to races.

For example, if the raw data producer writes to a temporary file and then moves to a final file, in order to make file-creation atomic, this process can look like both a move or create event to a polling observer, depending on where it polls the file system with respect to the temporary file creation and move operations. One solution is to create the temporary file in a distinct (unwatched) directory first, and move it into the watched source_directory when complete.

When running in one-off mode, files that are transfered will not have their original file overwritten with their fpack version, as occurs normally in watcher mode. This can be done manually, or safely ignored if the source data directory is periodically cleaned for space anyway.
The use of multiple glob-style patterns in the --patterns argument must only be done with mutually-exclusive patterns to prevent duplicate transfer attempts.
When running in --backfill-only mode, the task queue will immediately re-attempt at failures due to timeouts, or an inability to contact the database or ssh to the host. Keep an eye on the transfer, the threads could fill quickly with tasks that cannot complete easily.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
rawtransfer		rawtransfer
CHANGES.md		CHANGES.md
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Raw transfer

Installation

Configuration

Running rawtransfer

Watcher (set up as a systemd service)

One-off

Requirements

Description

Canary

Database

Gotchas

About

Uh oh!

Releases

Packages

Languages

License

GOTO-OBS/rawtransfer

Folders and files

Latest commit

History

Repository files navigation

Raw transfer

Installation

Configuration

Running rawtransfer

Watcher (set up as a systemd service)

One-off

Requirements

Description

Canary

Database

Gotchas

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages