diff --git a/docs/source/execution.rst b/docs/source/execution.rst index 474fc292..fec66d62 100644 --- a/docs/source/execution.rst +++ b/docs/source/execution.rst @@ -53,10 +53,8 @@ necessary). Choosing an orchestrator ------------------------ -Before running a command, we need to decide on an orchestrator. The -orchestrator is responsible for the first and third :ref:`tasks above -`, preparing the remote and collecting the results. The complete -set of orchestrators, accompanied by descriptions, can be seen by +Orchestrators are responsible for preparing the remote and collecting the results. +The complete set of orchestrators, accompanied by descriptions, can be seen by calling ``reproman run --list=orchestrators``. .. note:: @@ -66,29 +64,47 @@ calling ``reproman run --list=orchestrators``. only a limited set of functionality is available. If you are new to DataLad, consider reading the `DataLad handbook`_. -The main orchestrator choices are ``datalad-pair``, -``datalad-pair-run``, and ``datalad-local-run``. If the remote has -DataLad available, you should go with one of the ``datalad-pair*`` orchestrators. -These will sync your local dataset with a dataset on the remote machine -(using `datalad push`_), creating one if it doesn't already exist -(using `datalad create-sibling`_). - -``datalad-pair`` differs from the ``datalad-*-run`` orchestrators in the -way it captures results. After execution has completed, ``datalad-pair`` -commits the result *on the remote* via DataLad. On fetch, it will pull -that commit down with `datalad update`_. Outputs (specified via -``--outputs`` or as a job parameter) are retrieved with `datalad get`_. - -``datalad-pair-run`` and ``datalad-local-run``, on the other hand, -determine a list of output files based on modification times and -packages these files in a tarball. (This approach is inspired by -`datalad-htcondor`_.) On fetch, this tarball is downloaded locally and -used to create a `datalad run`_ commit in the *local* repository. - -There is one more orchestrator, ``datalad-no-remote``, that is designed -to work only with a local shell resource. It is similar to -``datalad-pair``, except that the command is executed in the same -directory from which ``reproman run`` is invoked. +Choose the orchestrator based on your setup and needs: + +**For remote resources with DataLad (recommended):** + +- **``datalad-pair``** - Best for persistent remote datasets + + - Creates and maintains DataLad datasets on the remote + - Commits results directly on the remote with full provenance + - Retrieves results using `datalad update`_ and `datalad get`_ + - Marks completed jobs with git refs (refs/reproman/JOBID) + +- **``datalad-pair-run``** - Best for capturing runs in local dataset + + - Prepares remote dataset like ``datalad-pair`` + - Packages results in tarball based on file modification times + - Creates a `datalad run`_ commit in your *local* repository + - Marks local commit with git ref (refs/reproman/JOBID) + +**For remote resources without DataLad:** + +- **``datalad-local-run``** - Remote execution, local DataLad integration + + - Uses plain remote directory (no DataLad on remote required) + - Captures results as `datalad run`_ commit locally + - Good when remote lacks DataLad but you want local provenance + +- **``plain``** - Simple remote execution + + - Basic file transfer using session.put() and session.get() + - No DataLad integration or provenance tracking + - Creates working directory named with job ID + - Sufficient for simple tasks but DataLad orchestrators recommended + +**For local execution:** + +- **``datalad-no-remote``** - Local dataset execution + + - Executes in current local dataset directory + - Behaves like ``datalad-pair`` but stays local + - Available for local shell resources only + - Good for testing workflows locally Revisiting :ref:`our concrete example ` and assuming we have an SSH resource named "foo" in our inventory, here's how we could diff --git a/docs/source/index.rst b/docs/source/index.rst index bd70fa23..604f222b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -5,6 +5,7 @@ ReproMan |---| tools for reproducible neuroimaging :maxdepth: 1 overview + tutorial-ssh acknowledgements Concepts and technologies diff --git a/docs/source/tutorial-ssh.rst b/docs/source/tutorial-ssh.rst new file mode 100644 index 00000000..0fd7bb69 --- /dev/null +++ b/docs/source/tutorial-ssh.rst @@ -0,0 +1,193 @@ +.. _tutorial-ssh: + +Tutorial: SSH Resource Workflows +********************************* + +This tutorial walks you through ReproMan workflows using SSH resources, from simple command execution to complex data analysis. +We'll start with a basic hello-world example, then progress to processing neuroimaging data. + +This tutorial demonstrates ReproMan's power in creating reproducible, traceable computational workflows across SSH-accessible computing environments. + +Overview +======== + +We'll cover two workflows: + +**Part 1: Hello World Example** + +1. Create a ReproMan SSH resource +2. Execute a simple command remotely +3. Fetch and examine results + +**Part 2: Dataset Analysis Example** + +1. Set up a DataLad dataset with input data +2. Execute MRIQC quality control analysis remotely +3. Collect and examine results with full provenance + +Prerequisites +============= + +For Part 1: + +- ReproMan installed on local machine (``pip install reproman``) +- Access to a remote server via SSH + +For Part 2: + +- DataLad support (``pip install 'reproman[full]'``) +- DataLad installed on remote server + +Part 1: Hello World Example +============================ + +Step 1: Create an SSH Resource +------------------------------- + +First, let's add an SSH resource to ReproMan's inventory. Replace ``your-server.edu`` with your actual server:: + + reproman create myserver --resource-type ssh --backend-parameters host=your-server.edu + +Verify the resource was created:: + + reproman ls --refresh + +.. note:: + + The ``--refresh`` flag is needed to check the current status of resources. Without it, you'll only see cached status information. + +You should see output similar to:: + + RESOURCE NAME TYPE ID STATUS + ------------- ---- -- ------ + myserver ssh 1a23b456-789c- ONLINE + +Step 2: Execute a Simple Command +--------------------------------- + +Let's start with a simple test to verify our setup works. Create a working directory and run a basic command:: + + mkdir -p hello-world + cd hello-world + + reproman run --resource myserver \ + --submitter local \ + --orchestrator plain \ + --output results \ + sh -c 'mkdir -p results && echo "Hello from ReproMan on $(hostname)" > results/hello.txt' + + +Step 3: Fetch Results +--------------------- + +The job will execute on the remote. To check status and fetch results:: + + # Check job status and get job ID + reproman jobs + + # Fetch results for completed job (replace JOB_ID with actual ID) + reproman jobs JOB_ID + +When you run ``reproman jobs JOB_ID``, ReproMan will automatically: + +- Fetch the output files from the remote to your local working directory +- Display job information and logs +- Unregister the completed job + +You should now see the results locally:: + + cat results/hello.txt + +.. note:: + + ReproMan creates a working directory on the remote resource automatically. By default, it uses ``~/.reproman/run-root`` on the remote. You can verify the file exists there with ``reproman login myserver``. + +Part 2: Dataset Analysis Example +================================= + +Now let's try a more realistic example with DataLad dataset management and neuroimaging analysis. + +Step 1: Set Up the Analysis Dataset +------------------------------------ + +Create a new DataLad dataset for our analysis:: + + # Create dataset for MRIQC quality control results + datalad create -d demo-mriqc -c text2git + cd demo-mriqc + +Install input data (using a demo BIDS dataset):: + + # Install demo neuroimaging dataset + datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata/raw + +.. note:: + This only installs the dataset structure - the actual data files are not + downloaded locally. DataLad will automatically fetch any data specified + by `--input` when the analysis runs. + + +Set up working directory to be ignored:: + + datalad run -m "Ignore processing workdir" 'echo "workdir/" > .gitignore' + +Step 2: Execute Analysis with DataLad Integration +------------------------------------------------- + +For full provenance tracking with DataLad:: + + reproman run --resource myserver \ + --submitter local \ + --orchestrator datalad-pair-run \ + --input sourcedata/raw \ + --output . \ + bash -c 'podman run --rm -v "$(pwd):/work:rw" nipreps/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02' + +.. note:: + The ``-v "$(pwd):/work:rw"`` part mounts your current directory into the + container at ``/work``, allowing the containerized software to access the + top level dataset. + +Step 3: Monitor Execution +------------------------- + +ReproMan jobs run in detached mode by default. Monitor progress:: + + # List all jobs + reproman jobs + + # Check specific job status (replace JOB_ID with actual ID) + reproman jobs JOB_ID + + # Fetch completed job results + reproman jobs JOB_ID --fetch + +For attached execution (wait for completion):: + + reproman run --resource myserver --follow \ + [... rest of command ...] + +Step 4: Examine Results and Provenance +-------------------------------------- + +Once the job completes, examine what was captured:: + + # View the provenance record + git log --oneline -1 + + # Look at captured job information + ls .reproman/jobs/myserver/ + + # View job specification + cat .reproman/jobs/myserver/JOB_ID/spec.yaml + + # Check MRIQC outputs + ls -la results/ + +The DataLad orchestrators create rich provenance records:: + + # View the detailed run record + git show --stat + + # See what files were modified/added + git show --name-status