Service to transfer raw telescope data from a local filesystem to a remote data centre.
Clone the repo:
git clone https://github.com/GOTO-OBS/rawtransfer
cd rawtransfer Install and setup a python (3.7+) virtual environment inside this directory (or your favourite place):
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pipInstall rawtransfer
pip install .This will create the raw-transfer console script accessible when your .venv is active.
It can be run directly but is better suited as a service. To get help on all possible
arguments run raw-transfer -h.
In order to run as an unattended service, but also even from the command line, you will need to:
- set up passwordless ssh access to the remote data centre.
- populate the
~/.pgpassfile of the user to be runningraw-transferwith the database credentials.
The service can be ran in two modes:
- A watcher mode, best implemented by a systemd service (or similar)
- This mode will continuously monitor a directory for changes and initiate real-time transfer of files
- A one-off mode
- This mode will perform a single catch-up operation and transfer any missing files that were created during a given time-frame
Create the file /etc/systemd/system/rawtransfer.service with the below contents.
You will need to replace the placeholder entries given in {}.
[Unit]
Description=Raw data transfer
After=network.target postgresql.service
Wants=postgresql.service
[Service]
User={user}
Environment=PYTHONUNBUFFERED=1
Type=simple
ExecStart=/bin/bash -ac '. {/path/to/}.venv/bin/activate && raw-transfer {arguments}'
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
As an example:
[Unit]
Description=Raw data transfer
After=network.target postgresql.service
Wants=postgresql.service
[Service]
User=goto
Environment=PYTHONUNBUFFERED=1
Type=simple
ExecStart=/home/goto/rawtransfer/.venv/bin/raw-transfer /data/goto/images goto@goto-gateway:/export/gotodata2/raw \
--patterns "t?_r*_ut?.fits" \
--metadata-regex "t(?P<tel>\d)_r(?P<run>\d+)_ut(?P<ut>\d).fits" \
--header-card DATE-OBS \
--db-host goto-gateway \
--db-user goto \
--db-name rawtransfer \
--db-port 5432 \
--db-table goto \
--copy-timeout 60 \
--workers 8 \
--ssh-controlmaster \
--canary-interval 15 \
--backfill-interval 1200 \
--backfill-hours 12 \
--backfill-recency-window 600 \
--overwrite-unpacked-interval 1200 \
--overwrite-unpacked-hours 12 \
--overwrite-unpacked-recency-window 7200 \
--logfile /var/log/rawtransfer/rawtransfer.log \
--slack-webhook-url "https://hooks.slack.com/services/super/secret" \
--verbose
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Note: you may need to set up the logfile directory with correct permissions
Now enable and start the systemd service:
sudo systemctl enable rawtransfer
sudo systemctl start rawtransferCheck things are running smoothly with:
sudo systemctl status rawtransferFollow the stream logging with:
sudo journalctl -fu rawtransferFrom the terminal one can run:
source /path/to/rawtransfer/.venv/bin/activate
raw-transfer /data/goto/images goto@goto-gateway:/export/gotodata2/raw \
--backfill-only \
--patterns "t?_r*_ut?.fits" \
--metadata-regex "t(?P<tel>\d)_r(?P<run>\d+)_ut(?P<ut>\d).fits" \
--header-card DATE-OBS \
--db-host goto-gateway \
--db-user goto \
--db-name rawtransfer \
--db-port 5432 \
--db-table goto \
--copy-timeout 60 \
--workers 8 \
--ssh-controlmaster \
--backfill-hours 24 \
--verboseThe important arguments here being --backfill-only, which instructs the program to only do a one-off backfill operation
then exist, and --backfill-hours 24, which indicates the age of the oldest file to transfer in hours as part of the
back fill. See Gotchas with respect to packing files.
Alongside python requirements (requirements.txt), the following binaries must
be available on the local machine.
find (must be GNU and version >=4.3.3)
fpack
fitscheck (this is installed as part of astropy and so should be available anyway)
A first check of the service is to ensure these binaries are available.
In plain-English the steps of the programme are given below, the example arguments from the above example systemd service file are added in parentheses to highlight their purpose.
Once running (as a service or otherwise), raw-transfer performs the following actions:
- Sets up a task queue with multiple thread workers (
--workers 8) to hold and perform the processing tasks. - Sets up a watchdog observer to
monitor the source directory (
/data/goto/images) for new files matching glob-style patterns (--patterns "t?_r*_ut?.fits"). The creation of these files should ideally be atomic - i.e. created by amv-like operation after writing to a temporary filename. If not, and they are created and then written, the event handler will iteratively check the file size, waiting a short time between (--patience). Once the fize size is unchanging, it assumes the writing is complete. This latter method is more prone to backfire, as well as significantly slowing the overall processing. - A task to process the file and transfer it to a remote location (
goto@goto-gateway:/export/gotodata2/raw) is passed to the task queue. Once a thread is available, it will start processing the file- If the transfer relies on remote resources of a host (
goto@goto-gateway) and/or database, it will check these are available. - The file is checked for integrity and packed (compressed)
- If a database is being used to store transfers, a sha1 hash for the packed file is calculated and inserted into the database. This effectively "locks" that hash and would prevent any concurrent processing of the file.
- The remote sub-directories are created if needed, following a
YYYY/MM/DDstructure. The location of a file within this structure is determined by the value of the datetime stamp in its header card (--header-card DATE-OBS). - A
scpcall is made to transfer the source file to the remote destination. - If a database is being used to store transfers, the row matching the sha1 hash of the file
is updated with metadata based on the filename
(
--metadata-regex "t(?P<tel>\d)_r(?P<run>\d+)_ut(?P<ut>\d).fits"), as well as time stamps for the file creation and transfer.
- If the transfer relies on remote resources of a host (
- If the task fails due to the inability to access any remote resource (e.g. a link goes down to the host/database), then it is reinserted into the task queue, and the task queue is paused until the Canary returns successfully.
- On an interval (
--backfill-interval 1200) a back-filling function is called that will check for files in the source directory that have a modification time in the previous 12 hour window which do not have an entry in the transfer database (based on a search for matchinsource_filepathentries). Any found to be missing will be added for processing. The use of this requires the use of a database.
On an interval (--canary-interval 15) a canary function
is called to check that remote resources are accessible. Upon failure it raises an exception that
triggers the pausing of the task queue until it next runs without exception.
A database is used as the central truth of all transfers that occur. Its presence is not required, but it is strongly encouraged as it is used to verify successful transfers and prevent collisions in processing. The database operations were written with PostgreSQL in mind, but given the simple nature of the queries should work with other major (SQL) database flavours with little to no amendments.
The database given by the connection arguments (--db-host,--db-user, --db-name, --db-port)
must have a table --db-table in the public schema (here we use the example "goto". This table
should be defined as follows (or equivalently for whatever database flavour is being used):
create table goto ( -- table name to match that given in the arguments when running
sha1_hash varchar(40) primary key, -- sha1sum of raw data file
source_filepath text, -- filepath to original raw data file
file_created_at timestamp without time zone, -- UTC time that original file was created
file_transferred_at timestamp without time zone, -- UTC time that file transfer completed
filepath text, -- filepath to transferred raw data file
tel smallint, -- [metadata column] tel number of raw data scraped from filename
run int, -- [metadata column] run number of raw data scraped from filename
ut smallint, -- [metadata column] ut number of raw data scraped from filename
created_at timestamp without time zone default now() not null -- UTC time that database record was created
);
create index on goto (source_filepath); -- Not necessary, but wise
create index on goto (file_transferred_at); -- Not necessary, but wise
Note: The [metadata columns] examples above are populated based on the named capture groups in the
--metadata-regexargument used in this readme. The names of these groups when running this service must match same-named columns in the database table.
- The directory to monitor should ideally be local device in order to react to inotify events. Monitoring a remote device, mounted via NFS for example, is possible, but this will invoke a simple polling observer, and the events captured are prone to races.
- For example, if the raw data producer writes to a temporary file and then moves to a final file, in
order to make file-creation atomic, this process can look like both a
moveorcreateevent to a polling observer, depending on where it polls the file system with respect to the temporary file creation and move operations. One solution is to create the temporary file in a distinct (unwatched) directory first, and move it into the watchedsource_directorywhen complete.
-
When running in one-off mode, files that are transfered will not have their original file overwritten with their
fpackversion, as occurs normally in watcher mode. This can be done manually, or safely ignored if the source data directory is periodically cleaned for space anyway. -
The use of multiple glob-style patterns in the
--patternsargument must only be done with mutually-exclusive patterns to prevent duplicate transfer attempts. -
When running in
--backfill-onlymode, the task queue will immediately re-attempt at failures due to timeouts, or an inability to contact the database or ssh to the host. Keep an eye on the transfer, the threads could fill quickly with tasks that cannot complete easily.
