Skip to content

anvaly/SpaRTa_AirFlow

Repository files navigation

Running of the Sparta pipeline has been automated via Airflow. Airflow is open-source platform used to create, schedule, and monitor workflows. The Sparta workflow is built around a shell script called “run.sh”. This script contains a list of R scripts that process the data. These R scripts are hosted in a Github repository and built into a docker container that is stored in an AWS ECR (Elastic Container Registry) repository. Data can be stored in S3, an AWS (Amazon Web Services) storage system, or on a file system.

Users log onto the Airflow user interface and then trigger this workflow and provide: locations of the input and output directories and whether to download data from AWS S3. The R scripts will search for a file called a factor sheet in the input directory and parse out the locations of the data from that file. If data is to be downloaded from S3, then the factor sheet will reference S3 locations for the data. Next, the list of R scripts in “run.sh” will run against the input directory and any output will be written to the output directory. Users can self-manage which R scripts to run by editing the list of scripts in “run.sh”. The execution of these R scripts is run in a docker container. When the Airflow workflow is triggered, the docker container is pulled from AWS ECR and then ran in AWS ECS (Elastic Container Services).

Whenever code in the R scripts is updated, then the docker container will need to be rebuilt with the updated code. This is done via Jenkins, an open-source automation server. After the code is updated in the Github repository, the user will trigger a Jenkins workflow that rebuilds the Docker container with the updated code and then pushes it to the AWS ECR repository. The next time the Airflow workflow is triggered, it will pull that latest Docker container.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages