Releases: rackslab/FireHPC
Releases · rackslab/FireHPC
v1.2.0
Added
- images:
- Add rocky9
- Add debian13
- Add debian14 (#42).
- Introduce
fhpc_namespaceextra variable with the name of containers namespace. - Add bash-completion script for
firehpccommand (#12). - Add
firehpc listcommand to list clusters present in state directory (#16). - Add
firehpc restorecommand to restore a cluster after restart or IP addresses change (#11→#31). - Save cluster settings on deployment so they can reused automatically in subsequent runs of
firehpc confandfirehpc restore(#7). - Report cluster settings in
firehpc status. - Introduce
firehpc updatecommand to change cluster settings. - Introduce
firehpc bootstrapcommand to create deployment environments. - Add
firehpc {conf,deploy} --ansible-optsoption to append additional options toansible-playbookcommand (#44). - Integrated management of virtual environment to multiple versions of Ansible depending on targeted OS (#24).
- Add PIP requirements files to populate ansible-latest and ansible-2.16 deployment environments.
- load:
- Submit jobs randomly in existing QOS and partitions.
- Submit jobs of various sizes, with a power of 2 number (1, 2, 4, 8…) of cores or nodes, up to the full size of the cluster. A number of nodes is selected when Slurm SelectType plugin is linear, a number of cores is selected otherwise. Small jobs are more submitted than big jobs.
- Select job partition randomly weighted by their number of resources to favor largest partitions.
- Make some (about 1/10th) submitted jobs randomly fail (#9).
- Submit jobs with random durations and timelimit with low probability for jobs to reach their timelimit (#10).
- Support Slurm configuration without accounting service and QOS.
- Reduce load by a factor outside of business hours to simulate humans submitting less jobs when not at work (#29).
- Add
--time-off-factoroption to control by how much the load is divided outside of business hours. - Request GPUs allocations on partitions with gpu GRES.
- conf:
- Add possibility to define additional QOS and alternative partitions in Slurm.
- Add support for RHEL9 and compatibles distributions.
- Add possibility to define custom site file name in nginx role.
- Introduce metrics role to deploy prometheus, alloy and grafana.
- Declare nodes in Slurm configuration with their socket/cores/memory configuration extracted from RacksDB.
- Add
paramskey inslurm_partitionsparameter to give possibility to set any arbitrary Slurm partition configuration parameter in inventory. - Support Slurm native authentication in alternative to munge (#22).
- Add possibility to disable deployment of SlurmDBD accounting service (#20).
- Enable SlurmDBD regular archive and purge mechanism to avoid MariaDB database growing too much (#28). This can be disabled with
slurm_with_db_archive: falsein custom configuration. - Add restore playbook to update hosts file, restart Slurm services and resume unavailable nodes.
- Add mariadb and dependencies tags on mariadb role in slurm role dependencies.
- Support possibility to change priorities of hpck.it and Rackslab development repositories derivatives with
common_hpckit_prioritiesandcommon_devs_priorities. - Add
slurm_restd_portvariable in inventory to control slurmrestd TCP/IP listening port. - Support all Slurm-web to slurmrestd JWT authentication modes.
- Support gpu GRES in Slurm configuration (#39).
- docs:
- Add sysctl
fs.inotify.max_user_instancesvalue increase recommendation in README.md to avoid weird issue when launching many containers. - Mention Metrics stack and Slurm-web optional features in README.md with URL to access Grafana and Slurm-web interfaces.
- Explain in README.md Ansible core 2.16 requirement for both rocky8 and debian13 clusters with a method to install this version from PyPI repository.
- Mention
firehpc listcommand in manpage. - Mention
firehpc loadcommand in manpage. - Mention
firehpc restorecommand in manpage. - Mention
firehpc bootstrapcommand in manpage. - Mention cluster settings and
firehpc updatecommand in manpage. - Update README.md to mention bootstrap step in usage guide.
- Mention
--ansible-optsoption in manpage.
- Add sysctl
- pkgs: Introduce tests extra package with dependencies required to run tests.
Changed
- Replace
fhpc-emulate-slurm-usagecommand byfirehpc load(#13). - Transform
fhpc_nodesdictionary values from list of nodes to list of dictionaries to group nodes by type in RacksDB. firehpc ssh <cluster>now connects to admin host by default (#8).- Rename file
images.ymltoos/db.ymland name of deployment environment associated to all supported OS. - Replace section
[images]to[os]in system configuration with newdbandrequirementsparameters. - load:
- Change pending jobs limit formula to avoid number of jobs growing as fast as the number of nodes.
- Consider running jobs in addition to pending jobs when computing the number of new jobs to submit, in order to significantly reduce load on clusters out of working hours.
- conf:
- Install
socatpackage on all nodes in common role. - Use packages list instead of loop to install MariaDB packages.
- Enable
config_overridesslurmd parameter in Slurm configuration to avoid compute nodes sockets/cores/memory matching configuration check. - Move
maxtimeandstateSlurm partitions parameters inparamssub-dictionary. - Rename
slurm_partitions>node→nodeskey. - Change default Slurm authentication plugin from munge to slurm. This can be changed by setting
slurm_with_munge: truein Ansible inventory. - Launch
slurmrestdwith unprivileged system user when JWT authentication is enabled. - Adapt slurm role to support Slurm upstream packages on Debian.
- Install
- docs:
- Explain in manpage ssh command considers admin container by default.
- Update documentation of
--db,--schema,--customand--slurm-emulatoroptions ofconfandrestorecommands with their new semantics regarding management of cluster settings. - Use standard
tomllibinstead oftomlibexternal library of documentation Makefile.
Fixed
- core:
- Properly handle DBus error when getting containers addresses.
- Potential key conflict in dictionnary of SSH clients when multiple users connect to the same host with Paramiko library.
- Set jobs time limit to partition time limit when set to avoid jobs that exceed partition time limit.
- Remove cluster state on cluster clean.
- Check ansible playbook RC code and stop execution on failure.
- lib: Fix
firehpc-storage-wrapperstart failure due to already existing cluster and home directories. - load:
- Order of partition/qos variables in job submission informational message.
- Support of Slurm 24.05
sacctmgr show qos --jsonformat to retrieve the list of defined QOS. - Redirect jobs output to
/dev/nullto avoid filling filesystems with tons of inodes (#27).
- conf:
- Install mpi packages in parallel instead of sequential loop.
- Configure system locale to
en_US.UTF-8on rocky8. - Add SLURMRESTD_SECURITY=disable_user_check environment variable in slurmrestd service to allow running as slurm user.
- Containers namespace missing in Slurm-web gateway
[ui]>host. - Force creation of CA and LDAP certificates to override possibly existing certificates during bootstrap.
- Ignore cluster creation error in slurmdbd, as it is now automatically created when slurmctld registers to accounting service.
- Support Rackslab development repository derivatives on RHEL.
- Add admin hostname with namespace in addition to just the admin hostname in Slurm-web nginx site server names.
- Replace embedded templates by string concatenations.
- docs: Various formatting errors in manpage.
Removed
- conf: Drop DSA SSH host keys.
- docs: Remove
fhpc-emulate-slurm-usagemanpage.
v1.1.0
Added
- Integration with RacksDB to extract emulated cluster topology (#1).
- Support for debian12 (Debian bookworm) in OS images sources YAML file.
- Introduce
fhpc_addresses,fhpc_nodes,fhpc_emulator_modeandfhpc_dbextra variables. The first is a hash with containers as keys and the list of IP addresses as values. The second is also a hash with node tags as keys and the list of nodes assigned with the tag in values. The third is a boolean set to true when--slurm-emulatoroption is set onfirehpccommand line. The fourth is the local absolute path to RacksDB database. - Possibility to run command with SSH paramiko library in addition to ssh binary executable.
- Add example RacksDB database.
- Add possibility to deploy users directory extracted from another existing cluster to have the same user accounts on multiple clusters eventually.
- Generate and manage groups tree internally. Groups definitions are exported to ansible with
fhpc_groupsextra variable and can be dumped withfirehpc statuscommand. - Support containers namespace to allow multiple users start the same virtual clusters on the same host without conflict.
- cli:
- Support for tags to filter deployed configuration tasks.
- Report cluster status in JSON format with
--jsonoption. - Add
--slurm-emulatoroption to deploy and configure a cluster with emulated Slurm cluster nodes (only one admin node with up to 64k virtual compute nodes). - Add
--usersoption on deploy command to extract users directory from another existing cluster. - Introduce
fhpc-emulate-slurm-usagecommand to emulate random usage of Slurm cluster. - Add
startandstopcommands to respectively start and stop all containers of an emulated cluster.
- conf:
- Optional support of Rackslab developement Deb and RPM repositories, disabled by default.
- Introduce racksdb role to install RacksDB and deploy database content.
- Introduce slurmweb role to install and setup Slurmweb, optional and disabled by default.
- Support multiple Slurm accounts definitions with hierarchy and control of users membership.
- Add tags on all roles.
- Add variable for slurmrestd socket path in slurm role.
- Support optional additional slurmdbd parameters.
- Deploy SSH root private and public keys on admin.
- Generate /etc/hosts with all cluster IP addresses and hostnames.
- Add
nodeset_foldandnodeset_expandJinja2 filters. - Support Slurm emulation with fully virtual nodes (up to 64k).
- Support optional secondary groups in LDAP directory.
- Add possibility to deploy Redis server on admin host.
- Use
fhpc_groupsfor defaultslurm_accountsvariable value and to define LDAP groups. - Use
fhpc_dbfor defaultracksdb_databasevariable value and to define RacksDB database content. - Install
bach-completionby default on all nodes with common role. - Install
clustershellon all nodes by default with new clustershell role. - Introduce nginx role.
- docs:
- Mention
confcommand--db,--schemaand--tagsoptions infirehpc(1)manpage. - Mention
deploycommand--dband--schemaoptions infirehpc(1)manpage. - Mention
statuscommand--jsonoption infirehpc(1)manpage. - Mention new
startandstopcommands infirehpc(1)manpage. - Add manpage for
fhpc-emulate-slurm-usage - Mention
confanddeploycommands--slurm-emulatoroption infirehpc(1)manpage. - Mention
deploycommand--usersoption infirehpc(1)manpage.
- Mention
Changed
- Replaced notion of zone in favor of cluster, both in CLI options and configuration variables names.
- Removed extra directory from source tree. It used to contain ansible machinectl connection plugin as Git submodule. This dependency is now injected in FireHPC as a package supplementary source in packages built by Fatbuildr.
- conf:
- Declare SSH host keys valid for both containers FQDN and short hostname in system known hosts file.
- Split ssh role in 3 steps: localkeys for local bootstrap, bootstrap to initialize files on containers with machinectl and main for normal
operations with SSH (known_hosts, SSH root keys). - Replace hardcoded admin hosts by selection of first admin group member for LDAP server hostname and Slurm server.
- Generate Slurm nodes and partitions based RacksDB database content.
- Split playbook by sections with hosts targets to avoid many skipped tasks.
- docs: Update after zone→cluster rename in CLI options.
Fixed
- Check OS images argument in CLI against values available in OS images YAML file instead of hard-coded argparse choices.
- Storage service stop and removal.
- Start storage service with container when cluster is started.
- Retry SSH connections up to 3 times in case of failure.
- Wait some time before starting the second container to finish container private network setup and avoid the following container from erasing
everything before completion. - Handle RacksDB format and schema errors with correct error message.
- Wait for both IPv4 and IPv6 addresses when retrieving container addresses, to avoid retrieving only IPv6 before IPv4 address is finally available.
- Correctly handle and report DNS errors in SSH module.
- conf:
- Open slurmd spool directory permissions to all users for running batch jobs scripts.
- Manage home directories ownership and permissions, in addition to some their content.
- Add missing common name in LDAP x509 TLS/SSL certificate.
- Do not use cgroups with Slurm in emulator mode.
- Force update of APT repositories metadata.
- Install
en_US.UTF-8locale on Debian, as well as done on RHEL by default. - Set
systemd-networkdDHCP client identifier to mac on RHEL to avoid getting a different address than those obtained by NetworkManager at boot, which eventually result in IPv4 adresses in/etc/hostsbeing removed from network interfaces when initial leases reach their timeout.
- docs: Grammatical error and typos in
firehpc(1)manpage - lib: limit network devices names to 12 characters to avoid network zone name errors with
systemd-nspawn.