Skip to content

ESGFNode|Replication

Matthew Harris edited this page Oct 8, 2013 · 5 revisions

ESG Data Replication

Introduction

ESG data replication involves the replication of (some) data from a specific data node to another, and the publication of the later. This redundancy increases data-loss protection and, especially, data access and transmission time.

ESG data replication is a three step process.

  1. Use the Replication Client to query an ESG gateway server for information about a dataset and generate a command file to download the dataset files.
  2. A data movement agent moves the files to the local disk.
  3. The Replication Client can then be used to check and publish the new copy of the dataset.

As noted, steps 1 and 3 can be performed using the Replication Client. For step 2, a separate data movement agent must be used. The Replication Client does not transfer files itself.

Tools

Replication Client

The Replication Client uses a configuration file to initialize some of its operating parameters. By default, this file is the default file used by the ESGCET software (usually ~/.esgcet/esg.ini). In the configuration file, the replication section is demarcated with a "[replication]" section label. Some of the labels used by replication use the same names as in other sections, for example log_level. Setting a value for log_level in the replication section will not cause other subsystems to use the replication logging level; if no value is set in the replication section, the value set in "[DEFAULT]" will be used.

Below is an example section:

[   1](https://github.com/ESGF/esgf.github.io/wiki/) [replication]
[   2](https://github.com/ESGF/esgf.github.io/wiki/) tds_directory = tdstmp/pcmdi
[   3](https://github.com/ESGF/esgf.github.io/wiki/) replica_root = /export/data/replicas
[   4](https://github.com/ESGF/esgf.github.io/wiki/) source_gateway = ESG-PCMDI
[   5](https://github.com/ESGF/esgf.github.io/wiki/) hessian_service_remote_metadata_url = http://host.tld/hess/remoteMetadataService
[   6](https://github.com/ESGF/esgf.github.io/wiki/) log_level = DEBUG

The settings above have these meanings.

tds_directory

This string points to a location on disk for the replication software to store working files. During the meta data query phase, the original Thredds catalog is stored here for use later in the process. A simple text file contains the mapping of dataset IDs to these catalog files. When a replica Thredds catalog file is created, it typically will also be stored here. The path may be relative to some run directory or absolute. The example above indicates a directory "tdstmp" in the current directory with the subdirectory "pcmdi." Sites replicating data may find it useful to make a separate subdirectory for each site they replicate from.

replica_root

The value assigned to "replica_root" becomes the prefix for the local directory path in the download control files. The tree structure of the originating data node will be replicated from this point. (See the "skip_count" configuration key in the Developers' Documentation for information about trimming the original path strings.)

source_gateway

This value specifies the code for the gateway fronting the data node with the original datasets. It is used when generating the replica catalogs. It is not the gateway publishing the replica datasets.

hessian_service_remote_metadata_url

The meta data used for generating the download control scripts is retrieved from the URL specified in this configuration parameter. It should point to the gateway from which you will be retrieving datasets, not the gateway to which the replicas will be published.

log_level

The "log_level" does not need to be set by most users. It can be used, however, to increase the verbosity of the logs from replication scripts without making all scripts in ESG log at the same level. See the "log_level" keyword in the Developers' Documentation for more information.

For a complete list of possible key/value pairs and their uses, see the "replication.configurationKeys" section of the Developers' Documentation .

Installation

Using easy_install (from the setuptools python package):

easy_install http://www.isi.edu/~cward/esg/replication/esgreplication-0.10.1.tar.gz

The "easy_install" utility will check that you have compatible versions of the ESG-CET (2.6.3 or better; 2.8.x recommended) and LXML (2.2 or better) packages. If you do, it will install the scripts in the CDAT installation bin directory, assumed to be "/usr/local/cdat/bin."

Usage Examples

The Replication Client provides two scripts to perform ESG data replication.

The script esgreplicate.py is used to query an ESG gateway for the meta data of a dataset that is then used to generate an input file for a data movement agent. The data movement agent performs the download the files for the dataset. (The replication facility itself does not move data. See "Data Movement Agents" below.)

The following simple example illustrates a basic usage of the script:

  esgreplicate.py -i dataset.txt  -o dataset-bdm.xml  -t bdm

The file "dataset.txt" is a text file with dataset ids, each on a separate line. The file "dataset-bdm.xml" is the name for the output file to be processed by the BDM data movement agent. The "-t" option tells the script to generate a command file for BDM, but as that is the default, it is not necessary to include it. Use "--help" as an argument to get the internal usage help text.

  • _ Note: _ The replica_root configuration parameter can be use to prefix the location of every file. So if using the guc type (globus-url-copy) make sure that the replica_root parameter starts either with file:// for local access or gsiftp:// for third party transfers.

With the dataset files downloaded, the esgpublishreplica.py script can then be used to perform some basic quality control checks on them and then issue a publication request to an ESG gateway for general use.

To perform a simple check on the downloaded data set, use the "check" action:

  esgpublishreplica.py -a check -d data_id -l service_name -p project_name -t /data/folder -v 1

This command will perform a scan, catalog generation, and catalog comparison for the dataset with id "data_id" found in /data/folder on a locally mounted file system. The URLs generated for each file will use the protocols configured by the named services in the esg.ini file.

The "check" steps may also be performed one-at-a-time with the actions "scan," "catalog," and "compare" respectively.

  esgpublishreplica.py -a scan -m mapfile.txt -d data_id -p project_name -t /data/folder
  esgpublishreplica.py -a catalog -m mapfile.txt -l service_name -d data_id -p project_name -v 1 -g ESG-GTWY -s original_tds.xml
  esgpublishreplica.py -a compare -s original_tds.xml -r replica_tds.xml

If the initial tests succeed, the replica may then be published to an ESG gateway server with the "publish" action:

  esgpublishreplica.py -a publish -r data_id.replica.v1.xml -d data_id -p project_name -v 1

where the "-r" value is the name of the catalog file generated by the "check" action above. Before running esgpublishreplica.py, you must have properly setup a security certificate for use with the gateway and the gateway use be configured to allow your certificate to publish data for the project indicated.

It must be noted that the Replication Client does not perform all possible data quality checks. The checks it does perform are just to assure that the files have transferred correctly and a functional TDS catalog can be produced.

There are other options available. Use the "--help" option to see a list.

Links

Data Movement Agents

Data movement agents are not included in this package. Examples of data movement agents are the Bulk Data Mover (BDM) and globus-url-copy. These must be installed separately.

BDM is available from LBL . The _ globus- url-copy _ tool is part of the Globus Toolkit .

Experimental Data Movement Agents

The command line help from the esgreplicate.py script will indicate that other data movement agents are possible; however, support for these is experimental. Feedback on how they work with the Replication Client will be appreciated, but the Replication Client development team can not promise to address all issues that arise. The experimental data movement agents are:

Clone this wiki locally