From aec6b350cc36e3291bcba91a288678fb2c215826 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Tue, 2 Sep 2025 15:37:27 -0500 Subject: [PATCH 1/9] Add hello-world and basic datalad-pair tutorial Tested against typhon --- docs/source/index.rst | 1 + docs/source/tutorial-ssh.rst | 178 +++++++++++++++++++++++++++++++++++ 2 files changed, 179 insertions(+) create mode 100644 docs/source/tutorial-ssh.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index bd70fa23..604f222b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -5,6 +5,7 @@ ReproMan |---| tools for reproducible neuroimaging :maxdepth: 1 overview + tutorial-ssh acknowledgements Concepts and technologies diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst new file mode 100644 index 00000000..c12ba044 --- /dev/null +++ b/docs/source/tutorial-ssh.rst @@ -0,0 +1,178 @@ +.. _tutorial-ssh: + +Tutorial: SSH Resource Workflows +********************************* + +This tutorial walks you through ReproMan workflows using SSH resources, from simple command execution to complex data analysis. +We'll start with a basic hello-world example, then progress to processing neuroimaging data. + +This tutorial demonstrates ReproMan's power in creating reproducible, traceable computational workflows across SSH-accessible computing environments. + +Overview +======== + +We'll cover two workflows: + +**Part 1: Hello World Example** +1. Create a ReproMan SSH resource +2. Execute a simple command remotely +3. Fetch and examine results + +**Part 2: Dataset Analysis Example** +1. Set up a DataLad dataset with input data +2. Execute MRIQC quality control analysis remotely +3. Collect and examine results with full provenance + +Prerequisites +============= + +- ReproMan installed (``pip install reproman``) +- Access to a remote server via SSH +- For Part 2: DataLad support (``pip install 'reproman[full]'``) + +Part 1: Hello World Example +============================ + +Step 1: Create an SSH Resource +------------------------------- + +First, let's add an SSH resource to ReproMan's inventory. Replace ``your-server.edu`` with your actual server:: + + reproman create myserver --resource-type ssh --backend-parameters host=your-server.edu + +Verify the resource was created:: + + reproman ls --refresh + +.. note:: + + The ``--refresh`` flag is needed to check the current status of resources. Without it, you'll only see cached status information. + +You should see output similar to:: + + RESOURCE NAME TYPE ID STATUS + ------------- ---- -- ------ + myserver ssh 1a23b456-789c- ONLINE + +Step 2: Execute a Simple Command +--------------------------------- + +Let's start with a simple test to verify our setup works. Create a working directory and run a basic command:: + + mkdir -p hello-world + cd hello-world + + reproman run --resource myserver \ + --sub local \ + --orc plain \ + --output results \ + sh -c 'mkdir -p results && echo "Hello from ReproMan on $(hostname)" > results/hello.txt' + + +Step 3: Fetch Results +--------------------- + +The job will execute on the remote. To check status and fetch results:: + + # Check job status and get job ID + reproman jobs + + # Fetch results for completed job (replace JOB_ID with actual ID) + reproman jobs JOB_ID + +When you run ``reproman jobs JOB_ID``, ReproMan will automatically: + +- Fetch the output files from the remote to your local working directory +- Display job information and logs +- Unregister the completed job + +You should now see the results locally:: + + cat results/hello.txt + +.. note:: + + ReproMan creates a working directory on the remote resource automatically. By default, it uses ``~/.reproman/run-root`` on the remote. You can verify the file exists there with ``reproman login myserver``. + +Part 2: Dataset Analysis Example +================================= + +Now let's try a more realistic example with DataLad dataset management and neuroimaging analysis. + +Step 1: Set Up the Analysis Dataset +------------------------------------ + +Create a new DataLad dataset for our analysis:: + + # Create dataset for MRIQC quality control results + datalad create -d demo-mriqc -c text2git + cd demo-mriqc + +Install input data (using a demo BIDS dataset):: + + # TODO does this have to be fetched locally? i think no? + # Install demo neuroimaging dataset + datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata/raw + + +Set up working directory to be ignored:: + + # TODO oneline with datalad run + echo "workdir/" > .gitignore + datalad save -m "Ignore processing workdir" .gitignore + +Step 2: Execute Analysis with DataLad Integration +------------------------------------------------- + +For full provenance tracking with DataLad:: + + reproman run --resource myserver \ + --sub local \ + --orc datalad-pair-run \ + --input sourcedata/raw \ + --output . \ + bash -c 'podman run --rm -v "$(pwd):/work:rw" poldracklab/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' + +Step 3: Monitor Execution +------------------------- + +ReproMan jobs run in detached mode by default. Monitor progress:: + + # List all jobs + reproman jobs + + # Check specific job status (replace JOB_ID with actual ID) + reproman jobs JOB_ID + + # Fetch completed job results + reproman jobs JOB_ID --fetch + +For attached execution (wait for completion):: + + reproman run --resource myserver --follow \ + [... rest of command ...] + +Step 4: Examine Results and Provenance +-------------------------------------- + +Once the job completes, examine what was captured:: + + # View the provenance record + git log --oneline -1 + + # Look at captured job information + ls .reproman/jobs/myserver/ + + # View job specification + cat .reproman/jobs/myserver/JOB_ID/spec.yaml + + # Check MRIQC outputs + ls -la results/ + +The DataLad orchestrators create rich provenance records:: + + # View the detailed run record + git show --stat + + # See what files were modified/added + git show --name-status From 9dfebc6cde3aa2335f070c5a220eaa4c9ab720d1 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 12:06:15 -0500 Subject: [PATCH 2/9] fixup: nipreps repo for mriqc container --- docs/source/execution.rst | 70 ++++++++++++++++++++++-------------- docs/source/tutorial-ssh.rst | 2 +- 2 files changed, 44 insertions(+), 28 deletions(-) diff --git a/docs/source/execution.rst b/docs/source/execution.rst index 474fc292..5905eec3 100644 --- a/docs/source/execution.rst +++ b/docs/source/execution.rst @@ -53,10 +53,8 @@ necessary). Choosing an orchestrator ------------------------ -Before running a command, we need to decide on an orchestrator. The -orchestrator is responsible for the first and third :ref:`tasks above -`, preparing the remote and collecting the results. The complete -set of orchestrators, accompanied by descriptions, can be seen by +Orchestrators are responsible for preparing the remote and collecting the results. + The complete set of orchestrators, accompanied by descriptions, can be seen by calling ``reproman run --list=orchestrators``. .. note:: @@ -66,29 +64,47 @@ calling ``reproman run --list=orchestrators``. only a limited set of functionality is available. If you are new to DataLad, consider reading the `DataLad handbook`_. -The main orchestrator choices are ``datalad-pair``, -``datalad-pair-run``, and ``datalad-local-run``. If the remote has -DataLad available, you should go with one of the ``datalad-pair*`` orchestrators. -These will sync your local dataset with a dataset on the remote machine -(using `datalad push`_), creating one if it doesn't already exist -(using `datalad create-sibling`_). - -``datalad-pair`` differs from the ``datalad-*-run`` orchestrators in the -way it captures results. After execution has completed, ``datalad-pair`` -commits the result *on the remote* via DataLad. On fetch, it will pull -that commit down with `datalad update`_. Outputs (specified via -``--outputs`` or as a job parameter) are retrieved with `datalad get`_. - -``datalad-pair-run`` and ``datalad-local-run``, on the other hand, -determine a list of output files based on modification times and -packages these files in a tarball. (This approach is inspired by -`datalad-htcondor`_.) On fetch, this tarball is downloaded locally and -used to create a `datalad run`_ commit in the *local* repository. - -There is one more orchestrator, ``datalad-no-remote``, that is designed -to work only with a local shell resource. It is similar to -``datalad-pair``, except that the command is executed in the same -directory from which ``reproman run`` is invoked. +Choose the orchestrator based on your setup and needs: + +**For remote resources with DataLad (recommended):** + +- **``datalad-pair``** - Best for persistent remote datasets + + - Creates and maintains DataLad datasets on the remote + - Commits results directly on the remote with full provenance + - Retrieves results using `datalad update`_ and `datalad get`_ + - Marks completed jobs with git refs (refs/reproman/JOBID) + +- **``datalad-pair-run``** - Best for capturing runs in local dataset + + - Prepares remote dataset like ``datalad-pair`` + - Packages results in tarball based on file modification times + - Creates a `datalad run`_ commit in your *local* repository + - Marks local commit with git ref (refs/reproman/JOBID) + +**For remote resources without DataLad:** + +- **``datalad-local-run``** - Remote execution, local DataLad integration + + - Uses plain remote directory (no DataLad on remote required) + - Captures results as `datalad run`_ commit locally + - Good when remote lacks DataLad but you want local provenance + +- **``plain``** - Simple remote execution + + - Basic file transfer using session.put() and session.get() + - No DataLad integration or provenance tracking + - Creates working directory named with job ID + - Sufficient for simple tasks but DataLad orchestrators recommended + +**For local execution:** + +- **``datalad-no-remote``** - Local dataset execution + + - Executes in current local dataset directory + - Behaves like ``datalad-pair`` but stays local + - Available for local shell resources only + - Good for testing workflows locally Revisiting :ref:`our concrete example ` and assuming we have an SSH resource named "foo" in our inventory, here's how we could diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index c12ba044..650643f7 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -131,7 +131,7 @@ For full provenance tracking with DataLad:: --orc datalad-pair-run \ --input sourcedata/raw \ --output . \ - bash -c 'podman run --rm -v "$(pwd):/work:rw" poldracklab/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' + bash -c 'podman run --rm -v "$(pwd):/work:rw" nipreps/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' Step 3: Monitor Execution ------------------------- From 711e3421069a60e5e21eddcece3c276a16172f7f Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 13:17:14 -0500 Subject: [PATCH 3/9] Use datalad run for workdir setup Co-Authored-By: Claude --- docs/source/tutorial-ssh.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index 650643f7..93634a29 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -117,9 +117,7 @@ Install input data (using a demo BIDS dataset):: Set up working directory to be ignored:: - # TODO oneline with datalad run - echo "workdir/" > .gitignore - datalad save -m "Ignore processing workdir" .gitignore + datalad run -m "Ignore processing workdir" 'echo "workdir/" > .gitignore' Step 2: Execute Analysis with DataLad Integration ------------------------------------------------- From 2f3b48af6fcc1355c6f043c7be37c1ea8b8b05fc Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 13:21:29 -0500 Subject: [PATCH 4/9] Add brief explanation of datalad install to tutorial --- docs/source/tutorial-ssh.rst | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index 93634a29..f939eb25 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -110,10 +110,14 @@ Create a new DataLad dataset for our analysis:: Install input data (using a demo BIDS dataset):: - # TODO does this have to be fetched locally? i think no? # Install demo neuroimaging dataset datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata/raw +.. note:: + This only installs the dataset structure - the actual data files are not + downloaded locally. DataLad will automatically fetch any data specified + by `--input` when the analysis runs. + Set up working directory to be ignored:: From d5554b4df6d470cb6d9624956e660ff72d90c694 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 13:22:52 -0500 Subject: [PATCH 5/9] add newbie docker/podman explanation of volume mounts --- docs/source/tutorial-ssh.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index f939eb25..a09ff721 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -135,6 +135,11 @@ For full provenance tracking with DataLad:: --output . \ bash -c 'podman run --rm -v "$(pwd):/work:rw" nipreps/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' +.. note:: + The ``-v "$(pwd):/work:rw"`` part mounts your current directory into the + container at ``/work``, allowing the containerized software to access the + top level dataset. + Step 3: Monitor Execution ------------------------- From a1a5e5c8285f4f29ab244341fd31067d9447a052 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 13:26:27 -0500 Subject: [PATCH 6/9] use full length option names in tutorial --- docs/source/tutorial-ssh.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index a09ff721..0d114a8c 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -63,8 +63,8 @@ Let's start with a simple test to verify our setup works. Create a working direc cd hello-world reproman run --resource myserver \ - --sub local \ - --orc plain \ + --submitter local \ + --orchestrator plain \ --output results \ sh -c 'mkdir -p results && echo "Hello from ReproMan on $(hostname)" > results/hello.txt' @@ -129,8 +129,8 @@ Step 2: Execute Analysis with DataLad Integration For full provenance tracking with DataLad:: reproman run --resource myserver \ - --sub local \ - --orc datalad-pair-run \ + --submitter local \ + --orchestrator datalad-pair-run \ --input sourcedata/raw \ --output . \ bash -c 'podman run --rm -v "$(pwd):/work:rw" nipreps/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' From e2687d9e049e0b6f137fd8e964474fad2dfe5cf3 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 13:29:17 -0500 Subject: [PATCH 7/9] fixup list spacing --- docs/source/tutorial-ssh.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index 0d114a8c..9c15539c 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -14,11 +14,13 @@ Overview We'll cover two workflows: **Part 1: Hello World Example** + 1. Create a ReproMan SSH resource 2. Execute a simple command remotely 3. Fetch and examine results **Part 2: Dataset Analysis Example** + 1. Set up a DataLad dataset with input data 2. Execute MRIQC quality control analysis remotely 3. Collect and examine results with full provenance From 286744e2cf98267a052a29e579489d9f5579be5f Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Wed, 3 Sep 2025 13:33:45 -0500 Subject: [PATCH 8/9] clarify requirements on local vs remote --- docs/source/tutorial-ssh.rst | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst index 9c15539c..0fd7bb69 100644 --- a/docs/source/tutorial-ssh.rst +++ b/docs/source/tutorial-ssh.rst @@ -15,22 +15,28 @@ We'll cover two workflows: **Part 1: Hello World Example** -1. Create a ReproMan SSH resource +1. Create a ReproMan SSH resource 2. Execute a simple command remotely 3. Fetch and examine results **Part 2: Dataset Analysis Example** 1. Set up a DataLad dataset with input data -2. Execute MRIQC quality control analysis remotely +2. Execute MRIQC quality control analysis remotely 3. Collect and examine results with full provenance Prerequisites ============= -- ReproMan installed (``pip install reproman``) +For Part 1: + +- ReproMan installed on local machine (``pip install reproman``) - Access to a remote server via SSH -- For Part 2: DataLad support (``pip install 'reproman[full]'``) + +For Part 2: + +- DataLad support (``pip install 'reproman[full]'``) +- DataLad installed on remote server Part 1: Hello World Example ============================ @@ -63,7 +69,7 @@ Let's start with a simple test to verify our setup works. Create a working direc mkdir -p hello-world cd hello-world - + reproman run --resource myserver \ --submitter local \ --orchestrator plain \ @@ -85,7 +91,7 @@ The job will execute on the remote. To check status and fetch results:: When you run ``reproman jobs JOB_ID``, ReproMan will automatically: - Fetch the output files from the remote to your local working directory -- Display job information and logs +- Display job information and logs - Unregister the completed job You should now see the results locally:: @@ -96,7 +102,7 @@ You should now see the results locally:: ReproMan creates a working directory on the remote resource automatically. By default, it uses ``~/.reproman/run-root`` on the remote. You can verify the file exists there with ``reproman login myserver``. -Part 2: Dataset Analysis Example +Part 2: Dataset Analysis Example ================================= Now let's try a more realistic example with DataLad dataset management and neuroimaging analysis. @@ -112,12 +118,12 @@ Create a new DataLad dataset for our analysis:: Install input data (using a demo BIDS dataset):: - # Install demo neuroimaging dataset + # Install demo neuroimaging dataset datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata/raw .. note:: - This only installs the dataset structure - the actual data files are not - downloaded locally. DataLad will automatically fetch any data specified + This only installs the dataset structure - the actual data files are not + downloaded locally. DataLad will automatically fetch any data specified by `--input` when the analysis runs. @@ -138,7 +144,7 @@ For full provenance tracking with DataLad:: bash -c 'podman run --rm -v "$(pwd):/work:rw" nipreps/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' .. note:: - The ``-v "$(pwd):/work:rw"`` part mounts your current directory into the + The ``-v "$(pwd):/work:rw"`` part mounts your current directory into the container at ``/work``, allowing the containerized software to access the top level dataset. From f4c81aefc5f165b43778ad2d9315540bd28b54e0 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Fri, 12 Sep 2025 07:44:21 -0500 Subject: [PATCH 9/9] fixup: spacing --- docs/source/execution.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/execution.rst b/docs/source/execution.rst index 5905eec3..fec66d62 100644 --- a/docs/source/execution.rst +++ b/docs/source/execution.rst @@ -54,7 +54,7 @@ Choosing an orchestrator ------------------------ Orchestrators are responsible for preparing the remote and collecting the results. - The complete set of orchestrators, accompanied by descriptions, can be seen by +The complete set of orchestrators, accompanied by descriptions, can be seen by calling ``reproman run --list=orchestrators``. .. note::