-
Notifications
You must be signed in to change notification settings - Fork 854
Releases: SchedMD/slurm
Releases · SchedMD/slurm
v26.05.1
Changes in 26.05.1
- stepmgr - Fix crash when launching async steps with a srun from a different release than the local slurmd.
- Fix slurmctld memory leak when using topology/tree and requesting multiple node sizes (-N 2:4:8).
- Fixed a crash caused by a race condition in the mysql code caused by checking the connection before it was safely locked.
- Fix DB performance issue when locating steps by SLUID in step completion.
- Fix step start and complete for jobs that survived an upgrade from Slurm <= 25.05 to Slurm 26.05.
- accounting_storage/slurmdbd - Fix slurmctld performance regression caused by unnecessary lock contention when packing a message to the slurmdbd agent.
- Fix warning in shtml2html.py when using Python 3.14+
- Block memory resize for jobs started before 26.05.
- Fix DBD state replay leaking unpacked messages on version mismatch.
- Fix runtime-added assoc/wckey uid NO_VAL under use_client_ids.
- Fix typo in _unitdir fallback path in slurm.spec.
- Fixed an incompatibility between 26.05 slurmd and 25.05+ slurmstepd when using a 24.11 sattach that would cause the sattach to hang.
- Fix parsing issue of "sacctmgr load" when trying to load a file that contains typed TRES.
- Error if salloc/sbatch/srun --requeue has an invalid option specified
- slurmctld - Fix ~37 second extra delay in retries for slurmdbd reconnection and state saves on NTP-synced systems.
Assets 3
v26.05.0
Changes in 26.05.0
- slurmctld - Fix interactive jobs erroneously killed by InactivityLimit when slurmctld is congested.
- data_parser/v0.0.45 - Remove fields that were deprecated in from v0.0.44
- slurmd - fix a potential crash during message forwarding
- slurmctld - Avoid possible crash under heavy load due to pointer comparisons mis-matching.
- Reject sbatch --external jobs when combined with --wrap
- Skip external nodes in _slurm_rpc_node_alias_addrs().
- Fix out-of-bounds array errors by resizing leaf_usage when tres_cnt changes.
- Add the option to set StorageHost or StorageBackupHost in slurmdbd.conf, or JobCompHost in slurm.conf, to a unix socket. To do so, prefix with "unix:", e.g., StorageHost=unix:/path/to/socket.
- Logs now better reflect mysql connection issues if connecting over a UNIX socket.
- Cache uid lookups to speed controller/dbd startup/reconfigure in some cases.
- All features will be tested before jobs are preempted.
- Improve clarity of gres/shards in sinfo GRES_USED field.
- Fix issue with SlurmctldParameters=max_powered_nodes affecting scontrol update nodename=... commands when it should not.
- slurmctld - return 303 See Other from GET /metrics/* when in backup standby, pointing to the configured primary controller
- slurmstepd - when a node fails on which the batch step is running, don't deallocate the batch step until after the job completes or is requeued.
- Reject the job if num_tasks is lower than the partition min_nodes
- Reject job if num_tasks is lower than min_nodes
- Reject num_tasks update when num_tasks < min_nodes
- Set max_nodes from num_tasks when not explicitly set
- switch/hpe_slingshot - Fix memory leak when the fabric manager responds to a job-lookup GET with HTTP 404.
- Fix node reboot with slurmd older than 26.05.
- Prevent the slurmctld's background thread from waiting on the purge files thread while holding the job and node write locks.
- Prevent slurmctld from crashing after removing all of a pending job's licenses from the configuration.
- Avoid purging reservations that reserved HRES when restarting or reconfiguring slurmctld.
- When restoring jobs from state, hold pending jobs whose license or HRES requests are no longer valid.
- Prevent slurmctld crash when a HetJob component fails to start.
- Avoid reverse DNS lookup's for connection logging unless DebugFlags=conmgr is configured.
- Return a properly formatted value for DefMemPerCPU.
- srun - Reject --async outside of an existing job allocation.
- Fixed hdf5 'Malformed file' error for sh5util -I extraction.
- Fixed memory leaks in sh5util.
- scontrol show federation now brackets IPv6 control host literals so the address and port are no longer ambiguous.
- Fix torus3d placements overlap detection for torus wrap.
- Add torus3d node_count overflow guard
- topology/torus3d - Add adaptive Morton encoding for large torus dimensions.
- Fix torus3d and other topology parsers reporting DUMPING errors when parsing fails.
- Requeue --no-requeue jobs when powering-up nodes are drained and requeue_on_resume_failure SchedulerParameter is set.
- Set PrologFLags=Alloc automatically when PrologFlags=DeferBatch is set. Without Alloc, DeferBatch will have no effect.
- Improve slurmdbd hourly rollup performance on large clusters.
- configure - Rename --with-http-parser to --with-libhttp-parser.
- configure - Add --with-llhttp-parser option.
- http_parser/libhttp_parser - If rpaths are enabled when configuring slurm, add rpath to libhttp_parser plugin.
- Add new http_parser/llhttp_parser plugin.
- Add new url_parser/internal plugin.
- interfaces/http_parser - If HttpParserType is not specified, no longer default to using the http_parser/libhttp_parser plugin. Instead try to first load http_parser/libhttp_parser then http_parser/llhttp_parser.
- interfaces/url_parser - If UrlParserType is not specified, no longer default to using the url_parser/libhttp_parser plugin. Instead try to first load url_parser/libhttp_parser then url_parser/internal.
- slurmrestd - Fix pipelined HTTP/1.1 requests after the first message on a keep-alive connection.
- http_parser/libhttp_parser - Prevent memory leak if a connection ends early.
- srun - Add --parsable to emit the bare step id for easier scripting of --async steps.
- Enable case insensitive comparison to check for srun_exclusive_allocation in LaunchParameters.
- Fix JobAccountGather failing on glibc 2.43+ due to a sscanf() %Nc behavior change.
- auth/slurm - Fix missing symbol issues with libjwt 2.1 caused by importing private base64 functions.
- auth/jwt - Fix missing symbol issues with libjwt 2.1 caused by importing private base64 functions.
- task/affinity - Work on nodes with over 1024 CPUs.
- Document SlurmctldHttpAuthParameters in slurm.conf(5).
- Document SlurmdHttpAuthParameters in slurm.conf(5).
- Fix sdiag RPC-by-user and RPC-by-type output for a full user stats table.
- Fix a regression in slurm 25.05 that caused requeued jobs to lose their license/HRES requests, which results in Slurm allowing the job to run without having sufficient licenses/HRES.
- Set SLURM_JOB_SLUID environment variable.
- Fix treating "topology" in slurmd's --conf= options as case sensitive.
- Fix losing scontrol-set Extra, InstanceId, and InstanceType on nodes across subsequent slurmd registrations.
- Allow a node's topology to be updated based on the dynamic slurmd's reported topology after a reboot.
- Allow llhttp-devel as an alternative to http-parser-devel when building RPMs.
- Fix not setting an end time to steps in a resized job and properly display them under the original SLUID in sacct.
- When using stepmgr and a job is resized, avoid allocating new steps in removed nodes.
- Fix not clearing node reasons on resume when not using an accounting storage plugin.
- Enforce distribution requirements (-m/--distribution on allocation cli commands) if job requests CountOnly GRES.
- Restrict libjwt to >= 1.10.0, < 3 at build and package time.
- auth/jwt and auth/slurm - Fix JWT authentication to work around a regression in libjwt 2.1.1 (and later).
- Fix JWT authentication failures on libjwt 2.x for parse-only credential paths.
- Fix Slurm Lua string to JSON/YAML (slurm.to_json or slurm.to_yaml) rejecting empty strings.
- Fix regression in 26.05.0 that caused scrun to exit with a fatal error before starting the container.
- Fix slurmctld crash and shutdown/reconfigure deadlock caused by accounting_storage callers racing the plugin teardown.
- Fix potential deadlock when the controller is brought up when the dbd was not running on the controller's previous run and there are jobs with a new script or env that needs to be send to the dbd.
- Add ESLURM_FILE_UNREADABLE error code to distinguish "file exists but cannot be read" from ENOENT.
- Avoid logging parsing warnings in CLI when topology.yaml does not strictly conform to OpenAPI specification.
- Avoid logging parsing warnings in CLI when namespace.yaml does not strictly conform to OpenAPI specification.
- Avoid logging parsing warnings in CLI when resources.yaml does not strictly conform to OpenAPI specification.
- Added swait, a client command to block until all of a job's steps have completed.
- Deprecated options ExclusiveUser and ExclusiveTopo are now mutually exclusive.
- Add warnings when creating or updating partitions that Exclusive=[NODE|TOPO] implies Oversubscribe=NO when Oversubscribe is set to YES or FORCE.
- scontrol - 'EXCLUSIVE_USER' and 'EXCLUSIVE_TOPO' will no longer be dumped by the '.partitions[].flags' field of the following commands: 'scontrol show partition --json', 'scontrol show partition --yaml'.
- slurmrestd - No longer parse or dump 'EXCLUSIVE_USER', 'EXC_USER_CLEAR', 'EXCLUSIVE_TOPO', or 'EXC_TOPO_CLEAR' as values for the '.partitions[].flags' field of the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'.
- scontrol - Remove 'partitions[].maximums.oversubscribe.jobs' and 'partitions[].maximums.oversubscribe.flags' fields from the output of the following commands: 'scontrol show partition --json', 'scontrol show partition --yaml'.
- slurmrestd - Remove 'partitions[].maximums.oversubscribe.jobs' and 'partitions[].maximums.oversubscribe.flags' fields from the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'
- slurmrestd - Enable parsing for 'partitions[].partition.exclusive' and 'partitions[].partition.oversubscribe' fields of the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'.
- No longer override a partition's OverSubscribe count when updating the partition with Exclusive=[NO|USER].
- Fix regression in 26.05.0rc1 that caused slurmscriptd to crash on receiving SIGPROF.
- Properly complete Slurm <= 25.11 jobs with a 26.05 slurmdbd.
- Add missing index on the sluid column of the job_table.
- Add sluid in archive dump/load.
Assets 3
4 people reacted
v25.11.6
Changes in 25.11.6
- scontrol - Allow updating InstanceId for batches of nodes as is possible for updating NodeAddr and NodeHosts.
- scontrol - Allow updating InstanceType for batches of nodes as is possible for updating NodeAddr and NodeHosts.
- Fix problem when using sacctmgr to remove a default account for a user when more than one is set.
- Fix sacctmgr silently ignoring trailing characters in numeric options.
- Fix sbcast with auth/slurm when user doesn't exist on slurmctld.
- Fix stepmgr crash with using sbcast with auth/slurm.
- Fix memory leak in stepmgr stepd.
- Reject untrusted REQUEST_COMPLETE_PROLOG.
- Fix jobs getting stuck in COMPLETING state when PrologFlags=RunInJob is configured by passing EpilogMsgTime to slurmstepd.
- Fix external nodes incorrectly marked as not responding after state transitions such as drain/undrain or resume.
- slurmstepd - Prevent crash when UnkillableStepTimeout is reached and Slurm is configured with --enable-memory-leak-debug.
- slurmctld - Fix possible hang during reconfigure due to slow client I/O due to timeout not being enforced.
- slurmctld - Fix possible hang during shutdown due to slow client I/O due to timeout not being enforced.
- slurmctld - Avoid race condition during shutdown that could cause a crash while attempting to read from a connection.
- Fix parsing issue for GRES resources that contain a hyphen ("-") in their name when using sacctmgr.
- Ensure that a request for zero licenses does not prevent a job from running when all licenses are in-use or reserved.
- slurmctld - Fix crash on startup due to race condition when I/O is processed before the connection (conn) plugin finishes initialization.
- slurmdbd - Fix crash from race condition during shutdown when a persistent connection closes its database connection after the accounting_storage plugin has already unloaded.
- slurmrestd - Fixed memory leak resulting from specifying an empty node_list in the request body of the following endpoints: 'POST /slurm/v0.0.4[3-5]/reservation' 'POST /slurm/v0.0.4[3-5]/reservations'
- Prevent deadlock when replacing nodes in reservations.
- Fix slow scheduling for multi-segment jobs with topology/block when blocks have fewer available nodes than the requested segment size.
- serializer/url-encoded - Allow non-NULL terminated strings to be passed to serialize_p_string_to_data().
- serializer/yaml - Prevent fataling if the size of a yaml configuration file is a multiple of 4096 bytes.
- Fix archive dump jobs "No records archived...but some found"
- Fix gcc-16 build errors.
- Fix slurmstepd crash in jobacctinfo_aggregate() handling when SlurmctldParameters=enable_stepmgr and JobAcctGatherType=jobacct_gather/none are set.
- Fix slurmd >= 25.05 crash on HetJob step launches from srun <= 24.11.
- Set the in-memory QOS priority to 0 after INFINITY is handled by slurmdbd.
- Do not allocate maintenance nodes to new reservations.
- slurmd - fix a potential crash during message forwarding
- Fix out-of-bounds array errors by resizing leaf_usage when tres_cnt changes.
- All features will be tested before jobs are preempted.
- slurmstepd - when a node fails on which the batch step is running, don't deallocate the batch step until after the job completes or is requeued.
Assets 3
4 people reacted
v25.05.8
Changes in 25.05.8
- slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
- slurmctld - Reject all job submissions as reserved user or group nobody(99).
- sbatch,srun,salloc - Reject arg --uid=99.
- sbatch,srun,salloc - Reject arg --gid=99.
- slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
- cons_tres - Prevent slurmctld SIGFPE during node selection.
- slurmctld - Fix possible hang during reconfigure due to slow client I/O due to timeout not being enforced.
- slurmctld - Fix possible hang during shutdown due to slow client I/O due to timeout not being enforced.
- slurmctld - Avoid race condition during shutdown that could cause a crash while attempting to read from a connection.
- slurmctld - Fix crash on startup due to race condition when I/O is processed before the connection (conn) plugin finishes initialization.
- slurmdbd - Fix crash from race condition during shutdown when a persistent connection closes its database connection after the accounting_storage plugin has already unloaded.
- Prevent deadlock when replacing nodes in reservations.
- Fix gcc-16 build errors.
- Fix build errors with recent versions of libcurl (8.16+).
- Fix catching invalid gpu-freq numbered values.
- Fix slurmd >= 25.05 crash on HetJob step launches from srun <= 24.11.
Assets 3
v26.05.0rc1
v26.05.0rc1
Pre-release
Pre-release
Changes in 26.05.0rc1
- Add SLURM_JOB_QOS to Prolog/Epilog environment.
- data_parser/v0.0.45 - Prevent memory leaks when freeing parsed lists.
- Return an xstring from slurm_create_reservation() instead of one created with strdup().
- scontrol - If a step terminates while its pids are bing queried 'scontrol listpids' will now print all successfully found pids instead of only logging an error.
- Prevent stepd_connect() from overriding the connect calls errno on error.
- slurmctld - Support 'verbose' query parameter in 'GET /readyz' endpoint.
- slurmd - Support 'verbose' query parameter in 'GET /readyz' endpoint.
- sacctmgr - In interactive mode, quiet/verbose will now apply to logging messages that are printed.
- sacctmgr - Quiet (--quiet/-Q) and verbose (--verbose/-v) command line options are now mutually exclusive. sacctmgr will immediately exit if both options are specified.
- sacctmgr - Quiet option (--quiet/-Q) is now applied to all logging messages, ensuring that it is enforced in all cases (e.g. logging from 'dump' previously would not honor --quiet)
- NO_NORMAL_ALL will only be printed if all NO_NORMAL_* flags are set.
- job_submit/lua - Log Lua stacktrace on runtime errors when calling slurm_job_submit() in job_submit.lua when 'debugflags=script' is set in slurm.conf or via environment
SLURM_DEBUG_FLAGS=script. - job_submit/lua - Log Lua stacktrace on runtime errors when calling slurm_job_modify() in job_submit.lua when 'debugflags=script' is set in slurm.conf or via environment
SLURM_DEBUG_FLAGS=script. - Added error handling and logging when a malformed RESPONSE_CONFIG RPC is received.
- Reject QOS creation requests that use nonuser flags
- Do not print nonuser QOS flags as valid flags
- Add "thread" as possible flag to "debugflags=" in slurm.conf and slurmdbd.conf.
- Do not allow clearing the partition from a reservation (e.g. scontrol update ReservationName=<res_name> PartitionName=''). Attempts to clear the partition from a reservation will be rejected by slurmctld. This change also fixes several potential slurmctld crashes.
- Add DebugFlag=SelectType log for when a node is skipped during job scheduling attempts because it is in COMPLETING state.
- slurmrestd - Add POWER_DOWN_ASAP and POWER_DOWN_FORCE to as valid node states in REST.
- slurmctld - Remove Slurmctld job state cache including support for SchedulerParameters=enable_job_state_cache in slurm.conf.
- slurmctld - Log error when saving to StateSaveLocation is too slow.
- slurmctld - Include StateSaveLocation statistics with /readyz endpoint.
- Fix error reading /proc/0/* when calling the api outside the step namespace.
- Alter sh5util -j to not allow array or het job ids.
- slurmctld - Improve ability to process RPCs in parallel by removing the need for the node write lock to process REQUEST_NODE_INFO, "metrics/partitions", and "metrics/nodes" requests, as well as when spawning the node health check agent.
- slurmctld - No longer acquire the job write lock when spawning the node health check agent.
- Fix long slurmd stop time when waiting on the slurmd to register.
- Fix slurmstepd memleak when initializing cgroup plugins.
- Fix slurmstepd memleak when initializing cgroup plugins.
- scrun - Update scrun.lua example in
man 1 scrunremoving requirement to compile Lua with JSON support. - Fix not applying constraints if CpuSpecList string is larger than 1024 chars.
- slurmrestd - Return 200 when querying a non existing partition. This affects the following endpoints: 'GET /slurm/v0.0.45/partition'
- slurmctld - Preserve intermediate job scheduling values to provide consistent scontrol show job output before and after reconfiguring or restarting the controller.
- Increase precision of time reported when timers issue warnings.
- scontrol - Print 'Job 12_23 not found' errors on stderr instead of stdout.
- stepmgr - handle when a steps requested ThreadsPerCore does not equal a nodes configured ThreadsPerCore
- Fix bug where requests from denied uids (i.e. "Users=-") to skip, delete or view (if using PrivateData) reservations were not rejected properly. This bug only existed for clusters not using AccountingStorageEnforce=associations (including other options that imply enforcing associations)
- Fix rare potential race condition in x11 forwarding that could result in a double free.
- salloc/scrun/srun/slurmstepd - Move setting of SLURM_TASKS_PER_NODE to the controller.
- gpu/nvml - The --gpu-freq job submission options will now set the actual Memory/GPU clock frequencies rather than the "Applications clocks" frequencies if the installed version of NVML supports it. This affects CUDA 11.3+ and prevents build errors in CUDA 13.0+ where the "Applications clocks" interface has been deprecated.
- gpu/nvml - Fix bug that prevented clock frequencies being reset on all GPUs at job completion when cgroups is constraining devices and there are multiple GPUs on the node.
- gpu/nvml - Fix bug that prevented --gpu-freq from being applied to the GPU clock frequency without specifying a memory clock frequency.
- Fixed SLURM_CLUSTER_NAME to be set to correct cluster when multiple clusters are available in a batch job.
- Respect arbitrary task distribution and return ESLURM_NOT_SUPPORTED if it is set together with an incompatible setting, namely topology/block, --spread-job, CR_LLN, pack_serial_at_end or bf_busy_nodes.
- slurmctld,slurmdbd: Avoid segfault when persistent connections fail to establish fully.
- Avoid non-needed numeric UID to user name translation when dumping node information node with unset reason for current node state. The following slurmrestd endpoints have changed: GET /slurm/v0.0.45/nodes GET /slurm/v0.0.45/node/{node_name} The following CLI commands have changed: scontrol show node {node_name} (--json|--yaml) scontrol show nodes (--json|--yaml)
- sinfo - Avoid non-needed numeric UID to user name translation when dumping node information node with unset reason for current node state changing: sinfo (--json|--yaml)
- slurmrestd - Add cores_per_socket to job submission to the following endpoints: GET /slurm/v0.0.45/job/submit GET /slurm/v0.0.45/job/allocate POST /slurm/v0.0.45/job/{job_id}
- slurmctld - Refuse RESPONSE_PING_SLURMD from incorrect nodes
- slurmctld - Skip MODE_3 HRes specific logic in backfill for job the do not request MODE_3 HRes.
- select/cons_tres - fix use-after-free of node_usage[].jobs
- Add status field to
scontrol ping --jsonandscontrol ping --yaml. - Add status field to '.components.schemas."v0.0.45_controller_ping"' to following endpoint: GET /slurm/v0.0.45/ping
- Add status field to
sacctmgr ping --jsonandsacctmgr ping --yaml. - Add status field to '.components.schemas."v0.0.45_slurmdbd_ping"' to following endpoint: GET /slurmdb/v0.0.45/ping
- slurmctld - Require authentication for the 'GET /readyz?verbose' endpoint, restricting access to only root and SlurmUser.
- slurmctld - Add threadpool to avoid overhead of creating new process threads which kernel freezes entire process to complete. This can be enabled with SlurmctldParameters=threadpool=enabled.
- Fix building with --with-jwt in a non-standard location.
- sacct - Add '.jobs[].sluid' field to the following commands: 'sacct --json', 'sacct --yaml'
- slurmrestd - Add '.jobs[].sluid' field to the following endpoints: 'GET slurmdb/v0.0.45/job', 'GET slurmdb/v0.0.45/jobs'
- slurmrestd - Add 'GET /healthz', 'GET /readyz', and 'GET /livez' endpoints.
- Fix potential glibc deadlock when tearing down the extern step when x11 forwarding is enabled.
- Fix FreeBSD build for --format=binary files, which are currently used for command help and usage text.
- Packaging - MUNGE is now a weak dependency to Slurm RPM and DEB packages, and can now be optionally installed or removed (installed by default).
- Add SuspendTime as a NodeName parameter in slurm.conf, enabling per-node power save configuration.
- slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.42/nodes/ POST /slurm/v0.0.42/node/{node_name}
- slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.43/nodes/ POST /slurm/v0.0.43/node/{node_name}
- slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.44/nodes/ POST /slurm/v0.0.44/node/{node_name}
- slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.45/nodes/ POST /slurm/v0.0.45/node/{node_name}
- Adding new archive/purge options to allow for explicit archiving of job_scripts and job_env without jobs.
- When the url_parser plugin does not load, change the log from an error to a warning. This plugin is optional and may not always be built.
- Fix rpmbuild slurm.spec --with selinux.
- Use internal dependency generator in slurm.spec.
- Switch to pkgconfig detection of many packages in slurm.spec.
- Add reqTRES components to the clonensscript and clonensepilog environment variables.
- Name all process POSIX threads consistently with format "worker[{index}]" when threads are not otherwise given a special name.
- slurmctld - Fix unresponsive nodes not being marked DOWN in clusters with frequent reconfigurations, as each reconfigure was updating the SlurmdTimeout countdown.
- slurmctld - If a node is replaced in a reservation mark that the reservation state changed. With bf_continue enabled, this fixes backfill potential incorrect planning if reservation node is replaced mid-cycle.
- Cover rare edge case in job queue sorting.
- Add job priority value to SLURM_RESUME_FILE.
- sbatch/srun/salloc - Make --gres=gpu:N and --gpus-per-node mutually exclusive.
- switch/hpe_slingshot - Add SwitchParameters=fm_authdir_ctld option.
- slurmd - Support POSIX signal SIGPROF to log debug state.
- slurmd - Increase default conmgr_max_connections from 50 to 512 to avoid connections being deferred on nodes with high...
Assets 3
v25.11.5
Changes in 25.11.5
- slurmctld - Prevent crash when deleting the only node in the cluster which also belongs to an inactive reservation.
- Fix assoc corruption on account add race condition.
- slurmctld - Re-enforce accounting policy limits when updating a job's QOS/assoc/partition.
- Prevent double call to requeue logic when PrologSlurmctld fails leading to extra records in database.
- Fix backfill to honor partition OverSubscribe=EXCLUSIVE
- stepmgr - Avoid leaking MPI ports when jobs that use the stepmgr are allocated nonconsecutive ports.
- Fix always showing 0 for slurm_cpus_alloc, slurm_nodes_alloc and slurm_memory_alloc in the metrics/jobs endpoint.
- Fix BPF token support compilation on systems with glibc >= 2.36 by using <sys/mount.h> where available instead of <linux/mount.h>.
- Fix a regression in 25.11.0 that could cause bounded hang after hitting conmgr_max_connections.
- Fix Insufficient Size error in NVML library call for long gpu names.
- slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
- slurmctld - Reject all job submissions as reserved user or group nobody(99).
- sbatch,srun,salloc - Reject arg --uid=99.
- sbatch,srun,salloc - Reject arg --gid=99.
- Jobs that complete quickly will not be marked as runaway.
- Correctly identify whether a job is in the DB.
- slurmctld - Avoid possible race condition during shutdown that could cause a crash in the HTTP handling logic.
- slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
- Fix race condition with cgroups not migrating slurmd process quickly, which caused EBUSY errors on startup.
- Fix slurmd reconfigure failure with cgroup/v2.
- Fix a regression added in 25.05.0 concerning how the slurmctld inherits /run/slurmctld/sack.socket when using AuthType=auth/slurm to prevent clients that connected during a reconfigure from hanging indefinitely.
- slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
- Avoid failure for spank options that do not require arguments.
- Allow archive load of qos_usage tables
- namespace/linux - fix memory leak in slurmstepd when namespace_p_recv_stepd() fails.
- namespace/linux - Fix potential crash on failure if mmap() or sem_init() fails during namespace construction.
- namespace/linux - fix unlikely error that could cause sigkill to be sent to a job during shutdown.
- namespace/linux - fix failure to detect namespace setup problems when launching a job.
- Fix slurmctld crash when querying the metrics endpoint after a partition is deleted with finished jobs still present.
- reservations - Fix creation with NodeCnt and Flags=IGNORE_JOBS failing when partition nodes are occupied.
- cons_tres - Prevent slurmctld SIGFPE during node selection.
Assets 3
v25.11.4
Changes in 25.11.4
- slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
- Prevent "period_start should already be set" errors when purging slurmdbd data and fix file names for archives of purged slurmdbd data.
- Skip x11 shutdown when x11 functionality was not requested.
- Fix build errors with recent versions of libcurl (8.16+).
- Fix scrun segfault with step_mgr and if environment is set.
- Fix two memory leaks located in the job info struct.
- Fix sacct not accepting -R flag.
- switch/nvidia_imex - Fix parsing of --network=unique-channel-per-segment option.
- topology/block - Fix parsing of --network=unique-channel-per-segment option.
- Fix compile errors building against glibc-2.43
- Prevent potential race that could cause process/script completion to go undetected. In the case of prolog/epilog, this would leave jobs stuck in CG state on nodes running many concurrent jobs. In the case of --get-user-env, it may time out resulting in jobs being requeued and held.
- switch/nvidia_imex - fix use-after-free when switch plugin debug logging is enabled.
- Fix bad umask() if switch/nvidia_imex fails to initialize.
- switch/nvidia_imex - fix memory leak if imex_dev_major is set.
- switch/nvidia_imex - fix potential memory leaks when unpacking the jobinfo structure.
- switch/nvidia_imex - prevent job from starting when imex channel allocation fails.
- When bf_continue is set, prevent backfill from potentially ending its cycle early due to the reason "System state changed" because of a node state change.
- Fix underflow in GRES selection when RestrictedCoresPerGPU is configured and the job is exclusive.
- Fix race on reconfigure that caused slurmctld to crash.
- Docs - Update the version constraints for libjwt to reflect the fact that only 1.x may be used with Slurm.
- Fix case when using sacctmgr where user assoc failed to be removed when removing an account with parent specified.
- cgroup/v2 - Fix issue which caused memory.peak to be inconsistently used.
- Prevent flex reservations from taking nodes from other reservations if those reservations do not request full nodes.
- Fix slurmctld crash situation with srun --overcommit.
- Adding log message to notify user of queries which are too large
Assets 3
v25.05.7
Changes in 25.05.7
- Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
- slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
- Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
- Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
- slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
- Fix compile errors building against glibc-2.43
- Fix race on reconfigure that caused slurmctld to crash
Assets 3
v25.11.3
Changes in 25.11.3
- Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
- Fixed issue where RestrictedCoresPerGPU with shared gres are limited to using restricted cores on one job per sharing gres.
- slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
- Fix "sacctmgr show conf" to properly display CommitDelay in seconds instead of as a boolean.
- Fix cron/requeued jobs being incorrectly reported as runaway
- slurmctld - Prevent the double-removal of accounting usage for jobs being requeued that are in the COMPLETED or COMPLETING state.
- When deleting a QOS from the DB, also remove it from partition QOS, AllowQOS and DenyQOS fields.
- Fixed bug that could cause the detected CPU count to be lower than actual available CPU count. This bug could have resulted in the default value for conmgr_threads being lower than the number of available CPUs in sackd, scrun, slurmctld, slurmscriptd, slurmd, slurmstepd, slurmdbd, and slurmrestd when the assigned CPUs are not sequential.
- slurmdbd - Prevent the following slurmdbd.conf options from overriding the default values of any in the list not specified: AllowNoDefAcct, AllResourcesAbsolute, DisableCoordDBD, DisableArchiveCommands.
- salloc/sbatch - Nesting a non-stepmgr salloc or sbatch inside an existing job allocation that enabled the stepmgr will no longer result in the inner job's steps failing to launch.
- Prevent slurmd -G from initializing sack processing thread.
- Added SLURM_CLUSTER_NAME, SLURM_JOB_ACCOUNT and SLURM_JOB_GROUP environment variables when a step is launched.
- slurmctld - Prevent marking external nodes as being unresponsive when reconfiguring if SlurmctldParameters=enable_configless is used.
- Fix potential segfault when attempting to look up the controller address via DNS in configless mode.
- Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
- slurmrestd - Avoid memory leak on authentication failures with invalid bearer tokens.
- Fix potential deadlock in _x11_signal_handler() during stepd_cleanup().
- slurmctld - Fix reservations AllowedPartitions logic leading to incorrect purge of valid reservations in some use-cases.
- slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured.
- Prevent potential controller segfault when reconfiguring after gres file updates.
- Reparent slurmd to a subcgroup to avoid conflicting with systemd.
- Fix sprio regression not handling comma separated list of jobids.
- slurmctld,slurmd - Fix memory leak when container ID is populated.
- slurmd - Fix P-core detection on processors with varying P-core frequencies and in cpuset-restricted environments.
- namespace/linux - add disable_bpf_token option.
- slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero.
- slurmctld - Avoid expedited requeue of jobs while waiting for job epilog script to complete.
- slurmctld - Prevent removing cloud nodes from the topology when putting them in the POWERED_DOWN state if they are present in topology.conf or topology.yaml and their node configuration did not specify the Topology option.
- interfaces/topology - When modifying a nodes topology with the Topology option in slurm.conf or the slurmd --conf Topology, change the topology to fully match the new topology.
- slurmctld - Allow changes to topology.conf or topology.yaml, and slurm.conf node configuration Topology option to take effect on a reconfigure or restart when power saving is enabled.
- slurmctld - Prevent backfill from combining future timeslots if they have different license reservations.
- Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
- slurmdbd - Avoid race condition that could cause a hang during shutdown when incoming connection fails.
- slurmdbd - Avoid crash during shutdown due to
sacctmgr shutdownrequest. - Fix slurmctld assertion when using "enable_async_reply" and certmgr is used for a TLS enabled cluster.
- Fix potential slurmd process leak when handling --get-user-env.
- slurmcltd - Avoid race condition that could cause the StateSaveLocation updates to be missed during shutdown.
- slurmcltd - Avoid race condition that could cause slurmctld to hang during shutdown before updating StateSaveLocation.
- slurmctld - Avoid race condition that could cause shutdown to wait on the wrong thread.
- Fix handling of 0 node test allocations in topology/block.
- slurmctld - In backfill, prevent unnecessarily testing jobs at future times using the select plugin if it is guaranteed to fail.
Assets 3
v25.11.2
Changes in 25.11.2
- slurmstepd - Revert regression that would apply job environment to container runtime invocation.
- Fix issue where reservations may start while required GRES resources are still being used by jobs.
- Fix slurmctld segfault when using --consolidate-segments.
- Expose slurm.CONSOLIDATE_SEGMENTS flag in lua.
- Expose the job record's segment_size in lua.
- job_submit/lua - Expose the job_desc's segment_size in lua.
- Prevent PMIx 5.0.8 and 5.0.9 clients from hanging when connecting to the PMIx server.
- Clarify warning when BPF tokens are not supported.
- slurmctld - Ensure we close already accepted conn before RPC flush check
- slurmctld - Fix rpc_queue feature causing statesave corruption while shutdown
- slurmctld - Ensure backfill has finished before saving state.
- slurmctld - Ensure main scheduler has finished before saving state.
- slurmctld - Fix error message while shutting down and state cannot be saved.
- Fix slurmctld double free that occurs when purging array jobs from memory only when using the topology/block plugin.
- Fix steps being rejected inside a batch job when using --cpus-per-task and --mem-per-cpu, and the job was submitted to multiple partitions, but not all of them had the same MaxMemPerCPU limit in place.
- slurmctld - Fix crash after failed reconfiguration while running jobs and priority/multifactor enabled.
- slurmctld - Fix jobs' QOS/association usage leading to potential underflow errors after a failed reconfiguration attempt.
- Guess NodeName with gethostname instead of gethostname_short
- Fix allowing job submissions when EnforcePartLimits=NO and the requested minimum number of nodes exceeds the total nodes in the specified partition(s).
- Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
- srun - fix bug where some input/output/error filename format identifiers were not expanded.
- Fix detecting restricted cores with SlurmdSpecOverride in nodes with more than one socket.
- slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.
- Fix average calculation in latency timers to show more accurate timing logs.