Skip to content

refactor: refine shell and scheduler arguments#805

Merged
aplowman merged 7 commits intodevelopfrom
fix/shell
May 7, 2025
Merged

refactor: refine shell and scheduler arguments#805
aplowman merged 7 commits intodevelopfrom
fix/shell

Conversation

@aplowman
Copy link
Contributor

@aplowman aplowman commented Apr 22, 2025

  • move Scheduler.options to QueuedScheduler and rename (with deprecation warning) as directives since it only applies to schedulers that use a job submission system
  • remove Scheduler attributes shell_args and shebang_args (and class variables DEFAULT_SHELL_ARGS and DEFAULT_SHEBANG_ARGS) and replace with Scheduler.shebang_executable, which will typically not be used, and is a way to override the shell's _executable and executable_args in the jobscript shebang line.
  • remove unused DEFAULT_SHELL_EXECUTABLE class variable from Scheduler sub-classes
  • add Shell.executable_args (e.g. for using bash in --login mode) and include this in Shell.executable
  • aims to fix Executables on the PATH are no longer found #762

@gcapes please could you test this PR on CSF3 and check if it resolves your issue?

Recommended changes to configuration files

If the schedulers.[scheduler-name].defaults.shebang_args is used in a configuration file, this will need to be removed and added to the shell's new executable_args configuration, e.g. shells.bash.defaults.executable_args, which should be a list of strings. An example change is shown here.

Recommended changes to workflow template files

The Scheduler argument options (for verbatim scheduler directives) has been renamed directives. The options key can still be used, but a deprecation warning will be printed.

- move `Scheduler.options` to `QueuedScheduler` and rename (with deprecation warning) as `directives` since it only applies to schedulers that use a job submission system
- remove `Scheduler` attributes `shell_args` and `shebang_args` (and class variables `DEFAULT_SHELL_ARGS` and `DEFAULT_SHEBANG_ARGS`) and replace with `Scheduler.shebang_executable`, which will typically not be used, and is a way to override the shell's executable and executable arguments in the jobscript shebang line.
- remove unused `DEFAULT_SHELL_EXECUTABLE` class variable from `Scheduler` sub-classes
- add `Shell.executable_args` and include this in `Shell.executable`
@aplowman aplowman requested a review from a team as a code owner April 22, 2025 15:29
@aplowman aplowman marked this pull request as draft April 22, 2025 15:39
@aplowman aplowman marked this pull request as ready for review April 22, 2025 16:27
Copy link
Collaborator

@dkfellows dkfellows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subject to one question, this is fine.

The question: what's the way of handling adding extra bits to a shebang line? The classic example of that is running a script interpreter with PATH searching using: #!/usr/bin/env interpreter though I've also seen an occasional need to pass other options there too. It's more common with Perl scripts I suppose...

In any case, as long as there is a documented answer, whatever it is, I'm happy with this PR.

@gcapes
Copy link
Collaborator

gcapes commented Apr 23, 2025

Thanks Adam.

I made a new nenv and installed hpcflow using pip install git+https://github.com/hpcflow/hpcflow-new.git@fix/shell
then installed matflow.

On CSF3 using SGE it has worked :)

Using SLURM, it doesn't. I updated the config like this:

  manchester-CSF3-slurm:
    invocation:
      environment_setup:
      match:
        hostname: login1.csf3.man.alces.network
    config:
      machine: manchester-CSF3-new
      telemetry: true
      log_file_path: logs/<<app_name>>_v<<app_version>>.log
      environment_sources:
      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs-slurm.yaml
      task_schema_sources: []
      command_file_sources: []
      parameter_sources: []
      default_scheduler: slurm
      default_shell: bash
      schedulers:
        direct:
          defaults: {}
        slurm:
          defaults: {}
          partitions:
            serial:
              num_cores:
              - 1
              - 1
              - 1
            multicore:
              num_nodes:
              - 1
              - 1
              - 1
              num_cores:
              - 2
              - 1
              - 168
              num_cores_per_node:
              - 2
              - 1
              - 168
              parallel_modes:
              - distributed
              - shared
              - hybrid
            multinode:
              num_nodes:
              - 2
              - 1
              - 
              num_cores:
              - 80
              - 40
              - 
              num_cores_per_node:
              - 40
              - 40
              - 40
              parallel_modes:
              - distributed
              - hybrid
      shells:
        bash:
          defaults:
            executable_args: [--login]

But the workflow wouldn't submit.

$ matflow go wheresmatlab.yaml 
Traceback (most recent call last):
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/bin/matflow", line 8, in <module>
    sys.exit(cli())
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/cli.py", line 253, in make_and_submit_workflow
    out = app.make_and_submit_workflow(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 1644, in <lambda>
    return lambda *args, **kwargs: func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 2882, in _make_and_submit_workflow
    wk = self._make_workflow(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 2762, in _make_workflow
    wk = self.Workflow.from_file(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/core/workflow.py", line 1293, in from_file
    template = cls._app.WorkflowTemplate.from_file(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/core/workflow.py", line 696, in from_file
    return cls.from_YAML_file(path_, variables=variables)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/core/workflow.py", line 622, in from_YAML_file
    return cls._from_data(data)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/core/workflow.py", line 539, in _from_data
    ts_dat, shared_data=cls._app._shared_data
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 1697, in _shared_data
    return cast("Mapping[str, Any]", self.template_components)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 1691, in template_components
    self._load_template_components()
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 1765, in _load_template_components
    for env_j in read_YAML_file(e_path):
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/core/utils.py", line 457, in read_YAML_file
    return read_YAML_str(yaml_str, typ=typ, variables=variables)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/hpcflow/sdk/core/utils.py", line 435, in read_YAML_str
    return yaml.load(yaml_str)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/ruamel/yaml/main.py", line 451, in load
    return constructor.get_single_data()
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowlocal/lib64/python3.9/site-packages/ruamel/yaml/constructor.py", line 114, in get_single_data
    node = self.composer.get_single_node()
  File "_ruamel_yaml.pyx", line 706, in ruamel.yaml.clib._ruamel_yaml.CParser.get_single_node
  File "_ruamel_yaml.pyx", line 724, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_document
  File "_ruamel_yaml.pyx", line 773, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 850, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_sequence_node
  File "_ruamel_yaml.pyx", line 775, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 889, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 773, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 850, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_sequence_node
  File "_ruamel_yaml.pyx", line 775, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 891, in ruamel.yaml.clib._ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 904, in ruamel.yaml.clib._ruamel_yaml.CParser._parse_next_event
ruamel.yaml.parser.ParserError: while parsing a block mapping
  in "<unicode string>", line 27, column 5
did not find expected key
  in "<unicode string>", line 38, column 5

@aplowman
Copy link
Contributor Author

Subject to one question, this is fine.

The question: what's the way of handling adding extra bits to a shebang line? The classic example of that is running a script interpreter with PATH searching using: #!/usr/bin/env interpreter though I've also seen an occasional need to pass other options there too. It's more common with Perl scripts I suppose...

In any case, as long as there is a documented answer, whatever it is, I'm happy with this PR.

Thanks @dkfellows, I'll add some documentation.

@aplowman
Copy link
Contributor Author

@gcapes glad this fixes the issue on SGE! Could you check your workflow template file and any task schema files in a yaml linter?

@gcapes
Copy link
Collaborator

gcapes commented Apr 28, 2025

Yep, the workflow contains the task schema definition, and is valid yaml:

template_components:
  task_schemas:
  - objective: locate_matlab
    actions:
    - commands:
      - command: which matlab
      environments:
      - scope:
          type: any
        environment: matlab_env  

tasks:
- schema: locate_matlab

resources:
  any:
    scheduler_args:
      directives:
        --time: 00:30:00
        --partition: serial

@aplowman
Copy link
Contributor Author

aplowman commented May 6, 2025

Yep, the workflow contains the task schema definition, and is valid yaml:

template_components:
  task_schemas:
  - objective: locate_matlab
    actions:
    - commands:
      - command: which matlab
      environments:
      - scope:
          type: any
        environment: matlab_env  

tasks:
- schema: locate_matlab

resources:
  any:
    scheduler_args:
      directives:
        --time: 00:30:00
        --partition: serial

Could you also check your environment definintion file(s) and any other YAML files pointed to in your config (and the built in template component files, if you've made any changes to them)?

I will merge this PR soon. I think your issue above is unrelated, so could you open a new issue if you are still having issues (unless you think it is related to this PR?).

@aplowman
Copy link
Contributor Author

aplowman commented May 7, 2025

I'll merge this now. I have added another PR which identifies which YAML file is malformatted here: #819, so if you are still having that issue @gcapes, perhaps you could try that branch out to identify where the problem is.

@aplowman aplowman merged commit 3df59e1 into develop May 7, 2025
51 checks passed
@aplowman aplowman deleted the fix/shell branch May 7, 2025 13:21
@gcapes
Copy link
Collaborator

gcapes commented May 7, 2025

Thanks @aplowman

I'm not really sure the right place for this, so just adding it here. I still have the problem, and this is the config error. Could you advise how to fix it?

$ matflow go wheresmatlab.yaml 
ConfigValidationError: 1 rule failed validation. 21/35 rules were tested.

Rule #33
--------
Path: ('shells', 'bash', 'defaults')
Value: {'executable_args': ['--login']}
Reasons:
 Condition callable returned False: `Value.allowed_keys('executable', 'os_args', 'WSL_executable', 'WSL_distribution', 'WSL_user')`.

config {'config_directory': PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new'), 'config_file_name': 'config.yaml', 'config_file_path': PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new/config.yaml'), 'config_file_contents': 'configs:\n  default:\n    invocation:\n      environment_setup:\n      match: {}\n    config:\n      machine: DEFAULT_MACHINE\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources: []\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: direct\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n      shells:\n        bash:\n          defaults: {}\n      alternative_unix_runtime_dir: /tmp/mbexegc2\n  manchester-CSF3:\n    invocation:\n      environment_setup:\n      match:\n        hostname:\n        - admin01.pri.csf3.alces.network\n        - hlogin1.pri.csf3.alces.network\n        - hlogin2.pri.csf3.alces.network\n        - login1.pri.csf3.alces.network\n        - login2.pri.csf3.alces.network\n        - vlogin01.pri.csf3.alces.network\n    config:\n      machine: manchester-CSF3\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n#      - /mnt/eps01-rds/jf01-home01/shared/software/matflow_envs/envs_CSF3.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_laura.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_patrick.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_patrick_latest_versions.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_default_CSF3.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_rory.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_test_demo_workflows.yaml\n#       - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_latest_hpcflowandmatflow.yaml\n       - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_latest_hpcflowandmatflow_slurm.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: sge\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        sge:\n          defaults: {}\n            #shebang_args: --login\n          parallel_environments:\n            null:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            amd.pe:\n              num_cores:\n              - 2\n              - 1\n              - 168\n              num_nodes:\n              - 1\n              - 1\n              - 1\n            smp.pe:\n              num_cores:\n              - 2\n              - 1\n              - 32\n              num_nodes:\n              - 1\n              - 1\n              - 1\n#            mpi-24-ib.pe:\n#              num_cores:\n#              - 48\n#              - 24\n#              - 120\n#              num_nodes:\n#              - 2\n#              - 1\n#              - 5\n      shells:\n        bash:\n          defaults: #{}\n            executable_args: [--login]\n      log_file_level: debug\n  manchester-CSF4:\n    invocation:\n      environment_setup:\n      match:\n        hostname: login*.csf4.local\n    config:\n      machine: manchester-CSF4\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n      - /mnt/eps01-rds/jf01-home01/shared/software/matflow_envs/envs_CSF4.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: slurm\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        slurm:\n          defaults:\n            shebang_args: --login\n          partitions:\n            serial:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            multicore:\n              num_nodes:\n              - 1\n              - 1\n              - 1\n              num_cores:\n              - 2\n              - 1\n              - 40\n              num_cores_per_node:\n              - 2\n              - 1\n              - 40\n              parallel_modes:\n              - distributed\n              - shared\n              - hybrid\n            multinode:\n              num_nodes:\n              - 2\n              - 1\n              - \n              num_cores:\n              - 80\n              - 40\n              - \n              num_cores_per_node:\n              - 40\n              - 40\n              - 40\n              parallel_modes:\n              - distributed\n              - hybrid\n      shells:\n        bash:\n          defaults: {}\n  manchester-CSF3-slurm:\n    invocation:\n      environment_setup:\n      match:\n        hostname: login1.csf3.man.alces.network\n    config:\n      machine: manchester-CSF3-new\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs-slurm.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: slurm\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        slurm:\n          defaults: {}\n          partitions:\n            serial:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            multicore:\n              num_nodes:\n              - 1\n              - 1\n              - 1\n              num_cores:\n              - 2\n              - 1\n              - 168\n              num_cores_per_node:\n              - 2\n              - 1\n              - 168\n              parallel_modes:\n              - distributed\n              - shared\n              - hybrid\n            multinode:\n              num_nodes:\n              - 2\n              - 1\n              - \n              num_cores:\n              - 80\n              - 40\n              - \n              num_cores_per_node:\n              - 40\n              - 40\n              - 40\n              parallel_modes:\n              - distributed\n              - hybrid\n      shells:\n        bash:\n          defaults:\n            executable_args: [--login]\n', 'config_key': 'manchester-CSF3-slurm', 'config_schemas': [<valida.schema.Schema object at 0x7f76967ee9a0>], 'invoking_user_id': '3f1bfca3-6f35-42fc-890e-d507baa0afc7', 'host_user_id': '3f1bfca3-6f35-42fc-890e-d507baa0afc7', 'host_user_id_file_path': PosixPath('/mnt/iusers01/support/mbexegc2/.local/share/matflow/user_id.txt')}

@aplowman
Copy link
Contributor Author

aplowman commented May 7, 2025

Thanks @aplowman

I'm not really sure the right place for this, so just adding it here. I still have the problem, and this is the config error. Could you advise how to fix it?

$ matflow go wheresmatlab.yaml 
ConfigValidationError: 1 rule failed validation. 21/35 rules were tested.

Rule #33
--------
Path: ('shells', 'bash', 'defaults')
Value: {'executable_args': ['--login']}
Reasons:
 Condition callable returned False: `Value.allowed_keys('executable', 'os_args', 'WSL_executable', 'WSL_distribution', 'WSL_user')`.

config {'config_directory': PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new'), 'config_file_name': 'config.yaml', 'config_file_path': PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new/config.yaml'), 'config_file_contents': 'configs:\n  default:\n    invocation:\n      environment_setup:\n      match: {}\n    config:\n      machine: DEFAULT_MACHINE\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources: []\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: direct\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n      shells:\n        bash:\n          defaults: {}\n      alternative_unix_runtime_dir: /tmp/mbexegc2\n  manchester-CSF3:\n    invocation:\n      environment_setup:\n      match:\n        hostname:\n        - admin01.pri.csf3.alces.network\n        - hlogin1.pri.csf3.alces.network\n        - hlogin2.pri.csf3.alces.network\n        - login1.pri.csf3.alces.network\n        - login2.pri.csf3.alces.network\n        - vlogin01.pri.csf3.alces.network\n    config:\n      machine: manchester-CSF3\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n#      - /mnt/eps01-rds/jf01-home01/shared/software/matflow_envs/envs_CSF3.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_laura.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_patrick.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_patrick_latest_versions.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_default_CSF3.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_rory.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_test_demo_workflows.yaml\n#       - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_latest_hpcflowandmatflow.yaml\n       - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_latest_hpcflowandmatflow_slurm.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: sge\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        sge:\n          defaults: {}\n            #shebang_args: --login\n          parallel_environments:\n            null:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            amd.pe:\n              num_cores:\n              - 2\n              - 1\n              - 168\n              num_nodes:\n              - 1\n              - 1\n              - 1\n            smp.pe:\n              num_cores:\n              - 2\n              - 1\n              - 32\n              num_nodes:\n              - 1\n              - 1\n              - 1\n#            mpi-24-ib.pe:\n#              num_cores:\n#              - 48\n#              - 24\n#              - 120\n#              num_nodes:\n#              - 2\n#              - 1\n#              - 5\n      shells:\n        bash:\n          defaults: #{}\n            executable_args: [--login]\n      log_file_level: debug\n  manchester-CSF4:\n    invocation:\n      environment_setup:\n      match:\n        hostname: login*.csf4.local\n    config:\n      machine: manchester-CSF4\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n      - /mnt/eps01-rds/jf01-home01/shared/software/matflow_envs/envs_CSF4.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: slurm\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        slurm:\n          defaults:\n            shebang_args: --login\n          partitions:\n            serial:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            multicore:\n              num_nodes:\n              - 1\n              - 1\n              - 1\n              num_cores:\n              - 2\n              - 1\n              - 40\n              num_cores_per_node:\n              - 2\n              - 1\n              - 40\n              parallel_modes:\n              - distributed\n              - shared\n              - hybrid\n            multinode:\n              num_nodes:\n              - 2\n              - 1\n              - \n              num_cores:\n              - 80\n              - 40\n              - \n              num_cores_per_node:\n              - 40\n              - 40\n              - 40\n              parallel_modes:\n              - distributed\n              - hybrid\n      shells:\n        bash:\n          defaults: {}\n  manchester-CSF3-slurm:\n    invocation:\n      environment_setup:\n      match:\n        hostname: login1.csf3.man.alces.network\n    config:\n      machine: manchester-CSF3-new\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs-slurm.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: slurm\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        slurm:\n          defaults: {}\n          partitions:\n            serial:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            multicore:\n              num_nodes:\n              - 1\n              - 1\n              - 1\n              num_cores:\n              - 2\n              - 1\n              - 168\n              num_cores_per_node:\n              - 2\n              - 1\n              - 168\n              parallel_modes:\n              - distributed\n              - shared\n              - hybrid\n            multinode:\n              num_nodes:\n              - 2\n              - 1\n              - \n              num_cores:\n              - 80\n              - 40\n              - \n              num_cores_per_node:\n              - 40\n              - 40\n              - 40\n              parallel_modes:\n              - distributed\n              - hybrid\n      shells:\n        bash:\n          defaults:\n            executable_args: [--login]\n', 'config_key': 'manchester-CSF3-slurm', 'config_schemas': [<valida.schema.Schema object at 0x7f76967ee9a0>], 'invoking_user_id': '3f1bfca3-6f35-42fc-890e-d507baa0afc7', 'host_user_id': '3f1bfca3-6f35-42fc-890e-d507baa0afc7', 'host_user_id_file_path': PosixPath('/mnt/iusers01/support/mbexegc2/.local/share/matflow/user_id.txt')}

No problem. The error is saying that executable_args is not an allowed key, but I made it an allowed key in this PR. So it looks like you're not using the (now deleted!) fix/shell branch, or you haven't pulled the (recently updated) develop branch.

@gcapes
Copy link
Collaborator

gcapes commented May 8, 2025

That's correct. I'm using the branch from PR #819 in an attempt to find which file contains the yaml error. I'm confused what changes would be needed to my config so maybe we can go through this in today's meeting?

@gcapes
Copy link
Collaborator

gcapes commented May 8, 2025

After a couple of hours, I finally got some output

$ matflow go wheresmatlab.yaml 
⠋ Adding new submission: setting environments...ERROR matflow.persistence: batch update exception!
Traceback (most recent call last):
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/bin/matflow", line 8, in <module>
    sys.exit(cli())
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/cli.py", line 253, in make_and_submit_workflow
    out = app.make_and_submit_workflow(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 1635, in <lambda>
    return lambda *args, **kwargs: func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 2889, in _make_and_submit_workflow
    submitted_js = wk.submit(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/core/workflow.py", line 3324, in submit
    exceptions, submitted_js = self._submit(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/core/workflow.py", line 3220, in _submit
    sub_js_idx = sub.submit(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/submission/submission.py", line 1162, in submit
    for js_indices, sched in self._unique_schedulers:
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/log.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/submission/submission.py", line 1025, in _unique_schedulers
    return self.get_unique_schedulers_of_jobscripts(self.jobscripts)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/submission/submission.py", line 1011, in get_unique_schedulers_of_jobscripts
    sched_idx := seen_schedulers.get(key := js.scheduler.unique_properties)
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/submission/jobscript.py", line 1108, in scheduler
    self._scheduler_obj = self._app.get_scheduler(
  File "/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/lib64/python3.9/site-packages/hpcflow/sdk/app.py", line 1953, in get_scheduler
    return scheduler_cls(**scheduler_kwargs)
TypeError: __init__() got an unexpected keyword argument 'directives'

However, my config doesn't contain this keyword. My history suggests I installed this branch, but I'm going to do a local repo install so it's easier to figure out which branch I'm using for hpcflow.
pip install git+https://github.com/hpcflow/hpcflow-new.git@feat/yaml-error-path

I have directives in the workflow file so I'll change that to options and try again.

Submits but errors.

$ cat wheresmatlab_2025-05-08_140003/artifacts/submissions/0/js_std/0/js_0_std.log 
/usr/bin/which: no matlab in (/net/scratch/mbexegc2/matflow-demo-workflows/.venv-hpcflowbranch/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/mnt/eps01-rds/jf01-home01/shared/software/matflow_exes/:/mnt/iusers01/support/mbexegc2/.local/share/matflow/links:/mnt/iusers01/support/mbexegc2/bin)

I think I need to try this again with a local install so I know for sure which branch I'm using. I guess that should just be develop/latest release? I only need feat/yaml-error-path if the main release version doesn't work for me?

@aplowman
Copy link
Contributor Author

aplowman commented May 8, 2025

I would expect that error if 1. the develop branch is not up to date, or 2. you are using direct execution instead of a scheduler (if you don't have a default scheduler set in the config for example).

I have just merged develop into feat/yaml-error-path, so I would suggest now sticking to that branch.

One thing to check is matflow config get --all to see if the correct config invocation is being used (e.g. your CSF3 slurm configuration).

@gcapes
Copy link
Collaborator

gcapes commented May 8, 2025

Ok, making some progress.
I'd already started again before reading your last comment so here's where I'm up to:

  • my config only had one hostname for CSF3(slurm), so it was trying to run directly
$ matflow config get --all
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────── Config 'default' ─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│  machine               'DEFAULT_MACHINE'                                                                                                                                                                                                   │
│  log_file_path         PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new/logs/MatFlow_v0.3.0a159.log')                                                                                                                                │
│  task_schema_sources   []                                                                                                                                                                                                                  │
│  parameter_sources     []                                                                                                                                                                                                                  │
│  command_file_sources  []                                                                                                                                                                                                                  │
│  environment_sources   []                                                                                                                                                                                                                  │
│  default_scheduler     'direct'                                                                                                                                                                                                            │
│  default_shell         'bash'                                                                                                                                                                                                              │
│  schedulers            {'direct': {'defaults': {}}}                                                                                                                                                                                        │
│  shells                {'bash': {'defaults': {}}}                                                                                                                                                                                          │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  • Having updated that, I still have errors which look like this:
$ matflow go wheresmatlab.yaml 
ConfigValidationError: 1 rule failed validation. 18/35 rules were tested.

Rule #30
--------
Path: ('schedulers', 'slurm', 'defaults')
Value: {'shebang_args': '--login'}
Reasons:
Condition callable returned False: `Value.allowed_keys('shebang_executable', 'directives', 'options', 'submit_cmd', 'show_cmd', 'del_cmd', 'js_cmd', 'array_switch', 'array_item_var', 'cwd_switch')`.

config {'config_directory': PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new'), 'config_file_name': 'config.yaml', 'config_file_path': PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new/config.yaml'), 'config_file_contents': 'configs:\n  default:\n    invocation:\n      environment_setup:\n      match: {}\n    config:\n      machine: DEFAULT_MACHINE\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources: []\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: direct\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n      shells:\n        bash:\n          defaults: {}\n      alternative_unix_runtime_dir: /tmp/mbexegc2\n  manchester-CSF3:\n    invocation:\n      environment_setup:\n      match:\n        hostname:\n        - admin01.pri.csf3.alces.network\n        - hlogin1.pri.csf3.alces.network\n        - hlogin2.pri.csf3.alces.network\n        - login1.pri.csf3.alces.network\n        - login2.pri.csf3.alces.network\n        - vlogin01.pri.csf3.alces.network\n    config:\n      machine: manchester-CSF3\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n#      - /mnt/eps01-rds/jf01-home01/shared/software/matflow_envs/envs_CSF3.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_laura.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_patrick.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_patrick_latest_versions.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_default_CSF3.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_rory.yaml\n#      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_test_demo_workflows.yaml\n#       - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_latest_hpcflowandmatflow.yaml\n       - /mnt/iusers01/support/mbexegc2/.matflow-new/envs_latest_hpcflowandmatflow_slurm.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: sge\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        sge:\n          defaults: {}\n            #shebang_args: --login\n          parallel_environments:\n            null:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            amd.pe:\n              num_cores:\n              - 2\n              - 1\n              - 168\n              num_nodes:\n              - 1\n              - 1\n              - 1\n            smp.pe:\n              num_cores:\n              - 2\n              - 1\n              - 32\n              num_nodes:\n              - 1\n              - 1\n              - 1\n#            mpi-24-ib.pe:\n#              num_cores:\n#              - 48\n#              - 24\n#              - 120\n#              num_nodes:\n#              - 2\n#              - 1\n#              - 5\n      shells:\n        bash:\n          defaults: #{}\n            executable_args: [--login]\n      log_file_level: debug\n  manchester-CSF4:\n    invocation:\n      environment_setup:\n      match:\n        hostname: login*.csf4.local\n    config:\n      machine: manchester-CSF4\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n      - /mnt/eps01-rds/jf01-home01/shared/software/matflow_envs/envs_CSF4.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: slurm\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        slurm:\n          defaults:\n            shebang_args: --login\n          partitions:\n            serial:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            multicore:\n              num_nodes:\n              - 1\n              - 1\n              - 1\n              num_cores:\n              - 2\n              - 1\n              - 40\n              num_cores_per_node:\n              - 2\n              - 1\n              - 40\n              parallel_modes:\n              - distributed\n              - shared\n              - hybrid\n            multinode:\n              num_nodes:\n              - 2\n              - 1\n              - \n              num_cores:\n              - 80\n              - 40\n              - \n              num_cores_per_node:\n              - 40\n              - 40\n              - 40\n              parallel_modes:\n              - distributed\n              - hybrid\n      shells:\n        bash:\n          defaults: {}\n  manchester-CSF3-slurm:\n    invocation:\n      environment_setup:\n      match:\n        hostname: login*.csf3.man.alces.network\n    config:\n      machine: manchester-CSF3-new\n      telemetry: true\n      log_file_path: logs/<<app_name>>_v<<app_version>>.log\n      environment_sources:\n      - /mnt/iusers01/support/mbexegc2/.matflow-new/envs-slurm.yaml\n      task_schema_sources: []\n      command_file_sources: []\n      parameter_sources: []\n      default_scheduler: slurm\n      default_shell: bash\n      schedulers:\n        direct:\n          defaults: {}\n        slurm:\n          defaults: #{}\n            shebang_args: "--login"\n          partitions:\n            serial:\n              num_cores:\n              - 1\n              - 1\n              - 1\n            multicore:\n              num_nodes:\n              - 1\n              - 1\n              - 1\n              num_cores:\n              - 2\n              - 1\n              - 168\n              num_cores_per_node:\n              - 2\n              - 1\n              - 168\n              parallel_modes:\n              - distributed\n              - shared\n              - hybrid\n            multinode:\n              num_nodes:\n              - 2\n              - 1\n              - \n              num_cores:\n              - 80\n              - 40\n              - \n              num_cores_per_node:\n              - 40\n              - 40\n              - 40\n              parallel_modes:\n              - distributed\n              - hybrid\n#      shells:\n#        bash:\n#          defaults:\n#            executable_args: [--login]\n', 'config_key': 'manchester-CSF3-slurm', 'config_schemas': [<valida.schema.Schema object at 0x7fefcf72a970>], 'invoking_user_id': '3f1bfca3-6f35-42fc-890e-d507baa0afc7', 'host_user_id': '3f1bfca3-6f35-42fc-890e-d507baa0afc7', 'host_user_id_file_path': PosixPath('/mnt/iusers01/support/mbexegc2/.local/share/matflow/user_id.txt')}

I think I have now fixed that too

$ matflow config get --all
╭────────────────────────────────────────────────────────────────────────────────────────────────────── Config 'manchester-CSF3-slurm' ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│  machine               'manchester-CSF3-new'                                                                                                                                                                                               │
│  log_file_path         PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new/logs/MatFlow_v0.3.0a159.log')                                                                                                                                │
│  task_schema_sources   []                                                                                                                                                                                                                  │
│  parameter_sources     []                                                                                                                                                                                                                  │
│  command_file_sources  []                                                                                                                                                                                                                  │
│  environment_sources   [PosixPath('/mnt/iusers01/support/mbexegc2/.matflow-new/envs-slurm.yaml')]                                                                                                                                          │
│  default_scheduler     'slurm'                                                                                                                                                                                                             │
│  default_shell         'bash'                                                                                                                                                                                                              │
│  schedulers            {                                                                                                                                                                                                                   │
│                            'direct': {'defaults': {}},                                                                                                                                                                                     │
│                            'slurm': {                                                                                                                                                                                                      │
│                                'defaults': {},                                                                                                                                                                                             │
│                                'partitions': {                                                                                                                                                                                             │
│                                    'serial': {'num_cores': [1, 1, 1]},                                                                                                                                                                     │
│                                    'multicore': {                                                                                                                                                                                          │
│                                        'num_nodes': [1, 1, 1],                                                                                                                                                                             │
│                                        'num_cores': [2, 1, 168],                                                                                                                                                                           │
│                                        'num_cores_per_node': [2, 1, 168],                                                                                                                                                                  │
│                                        'parallel_modes': ['distributed', 'shared', 'hybrid']                                                                                                                                               │
│                                    },                                                                                                                                                                                                      │
│                                    'multinode': {'num_nodes': [2, 1, None], 'num_cores': [80, 40, None], 'num_cores_per_node': [40, 40, 40], 'parallel_modes': ['distributed', 'hybrid']}                                                  │
│                                }                                                                                                                                                                                                           │
│                            }                                                                                                                                                                                                               │
│                        }                                                                                                                                                                                                                   │
│  shells                {'bash': {'defaults': {'executable_args': ['--login']}}}                                                                                                                                                            │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

After deleting two duplicate environment executables .... SUCCESS!!

It's now my priority to write this all up in hpcflow/matflow#376

Thanks for bearing with my stream-of-consciousness-debugging spam at the end of this PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Executables on the PATH are no longer found

3 participants

Comments