Releases: SchedMD/slurm

v26.05.1

09 Jun 20:46

@mcmult mcmult

slurm-26-05-1-1

d7913fa

v26.05.1 Latest

Latest

Changes in 26.05.1

stepmgr - Fix crash when launching async steps with a srun from a different release than the local slurmd.
Fix slurmctld memory leak when using topology/tree and requesting multiple node sizes (-N 2:4:8).
Fixed a crash caused by a race condition in the mysql code caused by checking the connection before it was safely locked.
Fix DB performance issue when locating steps by SLUID in step completion.
Fix step start and complete for jobs that survived an upgrade from Slurm <= 25.05 to Slurm 26.05.
accounting_storage/slurmdbd - Fix slurmctld performance regression caused by unnecessary lock contention when packing a message to the slurmdbd agent.
Fix warning in shtml2html.py when using Python 3.14+
Block memory resize for jobs started before 26.05.
Fix DBD state replay leaking unpacked messages on version mismatch.
Fix runtime-added assoc/wckey uid NO_VAL under use_client_ids.
Fix typo in _unitdir fallback path in slurm.spec.
Fixed an incompatibility between 26.05 slurmd and 25.05+ slurmstepd when using a 24.11 sattach that would cause the sattach to hang.
Fix parsing issue of "sacctmgr load" when trying to load a file that contains typed TRES.
Error if salloc/sbatch/srun --requeue has an invalid option specified
slurmctld - Fix ~37 second extra delay in retries for slurmdbd reconnection and state saves on NTP-synced systems.

Assets 3

v26.05.0

26 May 21:17

@mcmult mcmult

slurm-26-05-0-1

cc8ddc5

v26.05.0

Changes in 26.05.0

slurmctld - Fix interactive jobs erroneously killed by InactivityLimit when slurmctld is congested.
data_parser/v0.0.45 - Remove fields that were deprecated in from v0.0.44
slurmd - fix a potential crash during message forwarding
slurmctld - Avoid possible crash under heavy load due to pointer comparisons mis-matching.
Reject sbatch --external jobs when combined with --wrap
Skip external nodes in _slurm_rpc_node_alias_addrs().
Fix out-of-bounds array errors by resizing leaf_usage when tres_cnt changes.
Add the option to set StorageHost or StorageBackupHost in slurmdbd.conf, or JobCompHost in slurm.conf, to a unix socket. To do so, prefix with "unix:", e.g., StorageHost=unix:/path/to/socket.
Logs now better reflect mysql connection issues if connecting over a UNIX socket.
Cache uid lookups to speed controller/dbd startup/reconfigure in some cases.
All features will be tested before jobs are preempted.
Improve clarity of gres/shards in sinfo GRES_USED field.
Fix issue with SlurmctldParameters=max_powered_nodes affecting scontrol update nodename=... commands when it should not.
slurmctld - return 303 See Other from GET /metrics/* when in backup standby, pointing to the configured primary controller
slurmstepd - when a node fails on which the batch step is running, don't deallocate the batch step until after the job completes or is requeued.
Reject the job if num_tasks is lower than the partition min_nodes
Reject job if num_tasks is lower than min_nodes
Reject num_tasks update when num_tasks < min_nodes
Set max_nodes from num_tasks when not explicitly set
switch/hpe_slingshot - Fix memory leak when the fabric manager responds to a job-lookup GET with HTTP 404.
Fix node reboot with slurmd older than 26.05.
Prevent the slurmctld's background thread from waiting on the purge files thread while holding the job and node write locks.
Prevent slurmctld from crashing after removing all of a pending job's licenses from the configuration.
Avoid purging reservations that reserved HRES when restarting or reconfiguring slurmctld.
When restoring jobs from state, hold pending jobs whose license or HRES requests are no longer valid.
Prevent slurmctld crash when a HetJob component fails to start.
Avoid reverse DNS lookup's for connection logging unless DebugFlags=conmgr is configured.
Return a properly formatted value for DefMemPerCPU.
srun - Reject --async outside of an existing job allocation.
Fixed hdf5 'Malformed file' error for sh5util -I extraction.
Fixed memory leaks in sh5util.
scontrol show federation now brackets IPv6 control host literals so the address and port are no longer ambiguous.
Fix torus3d placements overlap detection for torus wrap.
Add torus3d node_count overflow guard
topology/torus3d - Add adaptive Morton encoding for large torus dimensions.
Fix torus3d and other topology parsers reporting DUMPING errors when parsing fails.
Requeue --no-requeue jobs when powering-up nodes are drained and requeue_on_resume_failure SchedulerParameter is set.
Set PrologFLags=Alloc automatically when PrologFlags=DeferBatch is set. Without Alloc, DeferBatch will have no effect.
Improve slurmdbd hourly rollup performance on large clusters.
configure - Rename --with-http-parser to --with-libhttp-parser.
configure - Add --with-llhttp-parser option.
http_parser/libhttp_parser - If rpaths are enabled when configuring slurm, add rpath to libhttp_parser plugin.
Add new http_parser/llhttp_parser plugin.
Add new url_parser/internal plugin.
interfaces/http_parser - If HttpParserType is not specified, no longer default to using the http_parser/libhttp_parser plugin. Instead try to first load http_parser/libhttp_parser then http_parser/llhttp_parser.
interfaces/url_parser - If UrlParserType is not specified, no longer default to using the url_parser/libhttp_parser plugin. Instead try to first load url_parser/libhttp_parser then url_parser/internal.
slurmrestd - Fix pipelined HTTP/1.1 requests after the first message on a keep-alive connection.
http_parser/libhttp_parser - Prevent memory leak if a connection ends early.
srun - Add --parsable to emit the bare step id for easier scripting of --async steps.
Enable case insensitive comparison to check for srun_exclusive_allocation in LaunchParameters.
Fix JobAccountGather failing on glibc 2.43+ due to a sscanf() %Nc behavior change.
auth/slurm - Fix missing symbol issues with libjwt 2.1 caused by importing private base64 functions.
auth/jwt - Fix missing symbol issues with libjwt 2.1 caused by importing private base64 functions.
task/affinity - Work on nodes with over 1024 CPUs.
Document SlurmctldHttpAuthParameters in slurm.conf(5).
Document SlurmdHttpAuthParameters in slurm.conf(5).
Fix sdiag RPC-by-user and RPC-by-type output for a full user stats table.
Fix a regression in slurm 25.05 that caused requeued jobs to lose their license/HRES requests, which results in Slurm allowing the job to run without having sufficient licenses/HRES.
Set SLURM_JOB_SLUID environment variable.
Fix treating "topology" in slurmd's --conf= options as case sensitive.
Fix losing scontrol-set Extra, InstanceId, and InstanceType on nodes across subsequent slurmd registrations.
Allow a node's topology to be updated based on the dynamic slurmd's reported topology after a reboot.
Allow llhttp-devel as an alternative to http-parser-devel when building RPMs.
Fix not setting an end time to steps in a resized job and properly display them under the original SLUID in sacct.
When using stepmgr and a job is resized, avoid allocating new steps in removed nodes.
Fix not clearing node reasons on resume when not using an accounting storage plugin.
Enforce distribution requirements (-m/--distribution on allocation cli commands) if job requests CountOnly GRES.
Restrict libjwt to >= 1.10.0, < 3 at build and package time.
auth/jwt and auth/slurm - Fix JWT authentication to work around a regression in libjwt 2.1.1 (and later).
Fix JWT authentication failures on libjwt 2.x for parse-only credential paths.
Fix Slurm Lua string to JSON/YAML (slurm.to_json or slurm.to_yaml) rejecting empty strings.
Fix regression in 26.05.0 that caused scrun to exit with a fatal error before starting the container.
Fix slurmctld crash and shutdown/reconfigure deadlock caused by accounting_storage callers racing the plugin teardown.
Fix potential deadlock when the controller is brought up when the dbd was not running on the controller's previous run and there are jobs with a new script or env that needs to be send to the dbd.
Add ESLURM_FILE_UNREADABLE error code to distinguish "file exists but cannot be read" from ENOENT.
Avoid logging parsing warnings in CLI when topology.yaml does not strictly conform to OpenAPI specification.
Avoid logging parsing warnings in CLI when namespace.yaml does not strictly conform to OpenAPI specification.
Avoid logging parsing warnings in CLI when resources.yaml does not strictly conform to OpenAPI specification.
Added swait, a client command to block until all of a job's steps have completed.
Deprecated options ExclusiveUser and ExclusiveTopo are now mutually exclusive.
Add warnings when creating or updating partitions that Exclusive=[NODE|TOPO] implies Oversubscribe=NO when Oversubscribe is set to YES or FORCE.
scontrol - 'EXCLUSIVE_USER' and 'EXCLUSIVE_TOPO' will no longer be dumped by the '.partitions[].flags' field of the following commands: 'scontrol show partition --json', 'scontrol show partition --yaml'.
slurmrestd - No longer parse or dump 'EXCLUSIVE_USER', 'EXC_USER_CLEAR', 'EXCLUSIVE_TOPO', or 'EXC_TOPO_CLEAR' as values for the '.partitions[].flags' field of the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'.
scontrol - Remove 'partitions[].maximums.oversubscribe.jobs' and 'partitions[].maximums.oversubscribe.flags' fields from the output of the following commands: 'scontrol show partition --json', 'scontrol show partition --yaml'.
slurmrestd - Remove 'partitions[].maximums.oversubscribe.jobs' and 'partitions[].maximums.oversubscribe.flags' fields from the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'
slurmrestd - Enable parsing for 'partitions[].partition.exclusive' and 'partitions[].partition.oversubscribe' fields of the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'.
No longer override a partition's OverSubscribe count when updating the partition with Exclusive=[NO|USER].
Fix regression in 26.05.0rc1 that caused slurmscriptd to crash on receiving SIGPROF.
Properly complete Slurm <= 25.11 jobs with a 26.05 slurmdbd.
Add missing index on the sluid column of the job_table.
Add sluid in archive dump/load.

Assets 3

4 people reacted

v25.11.6

14 May 20:01

@mcmult mcmult

slurm-25-11-6-1

276bd2f

v25.11.6

Changes in 25.11.6

scontrol - Allow updating InstanceId for batches of nodes as is possible for updating NodeAddr and NodeHosts.
scontrol - Allow updating InstanceType for batches of nodes as is possible for updating NodeAddr and NodeHosts.
Fix problem when using sacctmgr to remove a default account for a user when more than one is set.
Fix sacctmgr silently ignoring trailing characters in numeric options.
Fix sbcast with auth/slurm when user doesn't exist on slurmctld.
Fix stepmgr crash with using sbcast with auth/slurm.
Fix memory leak in stepmgr stepd.
Reject untrusted REQUEST_COMPLETE_PROLOG.
Fix jobs getting stuck in COMPLETING state when PrologFlags=RunInJob is configured by passing EpilogMsgTime to slurmstepd.
Fix external nodes incorrectly marked as not responding after state transitions such as drain/undrain or resume.
slurmstepd - Prevent crash when UnkillableStepTimeout is reached and Slurm is configured with --enable-memory-leak-debug.
slurmctld - Fix possible hang during reconfigure due to slow client I/O due to timeout not being enforced.
slurmctld - Fix possible hang during shutdown due to slow client I/O due to timeout not being enforced.
slurmctld - Avoid race condition during shutdown that could cause a crash while attempting to read from a connection.
Fix parsing issue for GRES resources that contain a hyphen ("-") in their name when using sacctmgr.
Ensure that a request for zero licenses does not prevent a job from running when all licenses are in-use or reserved.
slurmctld - Fix crash on startup due to race condition when I/O is processed before the connection (conn) plugin finishes initialization.
slurmdbd - Fix crash from race condition during shutdown when a persistent connection closes its database connection after the accounting_storage plugin has already unloaded.
slurmrestd - Fixed memory leak resulting from specifying an empty node_list in the request body of the following endpoints: 'POST /slurm/v0.0.4[3-5]/reservation' 'POST /slurm/v0.0.4[3-5]/reservations'
Prevent deadlock when replacing nodes in reservations.
Fix slow scheduling for multi-segment jobs with topology/block when blocks have fewer available nodes than the requested segment size.
serializer/url-encoded - Allow non-NULL terminated strings to be passed to serialize_p_string_to_data().
serializer/yaml - Prevent fataling if the size of a yaml configuration file is a multiple of 4096 bytes.
Fix archive dump jobs "No records archived...but some found"
Fix gcc-16 build errors.
Fix slurmstepd crash in jobacctinfo_aggregate() handling when SlurmctldParameters=enable_stepmgr and JobAcctGatherType=jobacct_gather/none are set.
Fix slurmd >= 25.05 crash on HetJob step launches from srun <= 24.11.
Set the in-memory QOS priority to 0 after INFINITY is handled by slurmdbd.
Do not allocate maintenance nodes to new reservations.
slurmd - fix a potential crash during message forwarding
Fix out-of-bounds array errors by resizing leaf_usage when tres_cnt changes.
All features will be tested before jobs are preempted.
slurmstepd - when a node fails on which the batch step is running, don't deallocate the batch step until after the job completes or is requeued.

Assets 3

4 people reacted

v25.05.8

14 May 20:01

@mcmult mcmult

slurm-25-05-8-1

90ff1dd

v25.05.8

Changes in 25.05.8

slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
slurmctld - Reject all job submissions as reserved user or group nobody(99).
sbatch,srun,salloc - Reject arg --uid=99.
sbatch,srun,salloc - Reject arg --gid=99.
slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
cons_tres - Prevent slurmctld SIGFPE during node selection.
slurmctld - Fix possible hang during reconfigure due to slow client I/O due to timeout not being enforced.
slurmctld - Fix possible hang during shutdown due to slow client I/O due to timeout not being enforced.
slurmctld - Avoid race condition during shutdown that could cause a crash while attempting to read from a connection.
slurmctld - Fix crash on startup due to race condition when I/O is processed before the connection (conn) plugin finishes initialization.
slurmdbd - Fix crash from race condition during shutdown when a persistent connection closes its database connection after the accounting_storage plugin has already unloaded.
Prevent deadlock when replacing nodes in reservations.
Fix gcc-16 build errors.
Fix build errors with recent versions of libcurl (8.16+).
Fix catching invalid gpu-freq numbered values.
Fix slurmd >= 25.05 crash on HetJob step launches from srun <= 24.11.

Assets 3

v26.05.0rc1

07 May 21:24

@MarshallGarey MarshallGarey

slurm-26-05-0-0rc1

be10eaa

v26.05.0rc1 Pre-release

Pre-release

Changes in 26.05.0rc1

Add SLURM_JOB_QOS to Prolog/Epilog environment.
data_parser/v0.0.45 - Prevent memory leaks when freeing parsed lists.
Return an xstring from slurm_create_reservation() instead of one created with strdup().
scontrol - If a step terminates while its pids are bing queried 'scontrol listpids' will now print all successfully found pids instead of only logging an error.
Prevent stepd_connect() from overriding the connect calls errno on error.
slurmctld - Support 'verbose' query parameter in 'GET /readyz' endpoint.
slurmd - Support 'verbose' query parameter in 'GET /readyz' endpoint.
sacctmgr - In interactive mode, quiet/verbose will now apply to logging messages that are printed.
sacctmgr - Quiet (--quiet/-Q) and verbose (--verbose/-v) command line options are now mutually exclusive. sacctmgr will immediately exit if both options are specified.
sacctmgr - Quiet option (--quiet/-Q) is now applied to all logging messages, ensuring that it is enforced in all cases (e.g. logging from 'dump' previously would not honor --quiet)
NO_NORMAL_ALL will only be printed if all NO_NORMAL_* flags are set.
job_submit/lua - Log Lua stacktrace on runtime errors when calling slurm_job_submit() in job_submit.lua when 'debugflags=script' is set in slurm.conf or via environment SLURM_DEBUG_FLAGS=script.
job_submit/lua - Log Lua stacktrace on runtime errors when calling slurm_job_modify() in job_submit.lua when 'debugflags=script' is set in slurm.conf or via environment SLURM_DEBUG_FLAGS=script.
Added error handling and logging when a malformed RESPONSE_CONFIG RPC is received.
Reject QOS creation requests that use nonuser flags
Do not print nonuser QOS flags as valid flags
Add "thread" as possible flag to "debugflags=" in slurm.conf and slurmdbd.conf.
Do not allow clearing the partition from a reservation (e.g. scontrol update ReservationName=<res_name> PartitionName=''). Attempts to clear the partition from a reservation will be rejected by slurmctld. This change also fixes several potential slurmctld crashes.
Add DebugFlag=SelectType log for when a node is skipped during job scheduling attempts because it is in COMPLETING state.
slurmrestd - Add POWER_DOWN_ASAP and POWER_DOWN_FORCE to as valid node states in REST.
slurmctld - Remove Slurmctld job state cache including support for SchedulerParameters=enable_job_state_cache in slurm.conf.
slurmctld - Log error when saving to StateSaveLocation is too slow.
slurmctld - Include StateSaveLocation statistics with /readyz endpoint.
Fix error reading /proc/0/* when calling the api outside the step namespace.
Alter sh5util -j to not allow array or het job ids.
slurmctld - Improve ability to process RPCs in parallel by removing the need for the node write lock to process REQUEST_NODE_INFO, "metrics/partitions", and "metrics/nodes" requests, as well as when spawning the node health check agent.
slurmctld - No longer acquire the job write lock when spawning the node health check agent.
Fix long slurmd stop time when waiting on the slurmd to register.
Fix slurmstepd memleak when initializing cgroup plugins.
Fix slurmstepd memleak when initializing cgroup plugins.
scrun - Update scrun.lua example in man 1 scrun removing requirement to compile Lua with JSON support.
Fix not applying constraints if CpuSpecList string is larger than 1024 chars.
slurmrestd - Return 200 when querying a non existing partition. This affects the following endpoints: 'GET /slurm/v0.0.45/partition'
slurmctld - Preserve intermediate job scheduling values to provide consistent scontrol show job output before and after reconfiguring or restarting the controller.
Increase precision of time reported when timers issue warnings.
scontrol - Print 'Job 12_23 not found' errors on stderr instead of stdout.
stepmgr - handle when a steps requested ThreadsPerCore does not equal a nodes configured ThreadsPerCore
Fix bug where requests from denied uids (i.e. "Users=-") to skip, delete or view (if using PrivateData) reservations were not rejected properly. This bug only existed for clusters not using AccountingStorageEnforce=associations (including other options that imply enforcing associations)
Fix rare potential race condition in x11 forwarding that could result in a double free.
salloc/scrun/srun/slurmstepd - Move setting of SLURM_TASKS_PER_NODE to the controller.
gpu/nvml - The --gpu-freq job submission options will now set the actual Memory/GPU clock frequencies rather than the "Applications clocks" frequencies if the installed version of NVML supports it. This affects CUDA 11.3+ and prevents build errors in CUDA 13.0+ where the "Applications clocks" interface has been deprecated.
gpu/nvml - Fix bug that prevented clock frequencies being reset on all GPUs at job completion when cgroups is constraining devices and there are multiple GPUs on the node.
gpu/nvml - Fix bug that prevented --gpu-freq from being applied to the GPU clock frequency without specifying a memory clock frequency.
Fixed SLURM_CLUSTER_NAME to be set to correct cluster when multiple clusters are available in a batch job.
Respect arbitrary task distribution and return ESLURM_NOT_SUPPORTED if it is set together with an incompatible setting, namely topology/block, --spread-job, CR_LLN, pack_serial_at_end or bf_busy_nodes.
slurmctld,slurmdbd: Avoid segfault when persistent connections fail to establish fully.
Avoid non-needed numeric UID to user name translation when dumping node information node with unset reason for current node state. The following slurmrestd endpoints have changed: GET /slurm/v0.0.45/nodes GET /slurm/v0.0.45/node/{node_name} The following CLI commands have changed: scontrol show node {node_name} (--json|--yaml) scontrol show nodes (--json|--yaml)
sinfo - Avoid non-needed numeric UID to user name translation when dumping node information node with unset reason for current node state changing: sinfo (--json|--yaml)
slurmrestd - Add cores_per_socket to job submission to the following endpoints: GET /slurm/v0.0.45/job/submit GET /slurm/v0.0.45/job/allocate POST /slurm/v0.0.45/job/{job_id}
slurmctld - Refuse RESPONSE_PING_SLURMD from incorrect nodes
slurmctld - Skip MODE_3 HRes specific logic in backfill for job the do not request MODE_3 HRes.
select/cons_tres - fix use-after-free of node_usage[].jobs
Add status field to scontrol ping --json and scontrol ping --yaml.
Add status field to '.components.schemas."v0.0.45_controller_ping"' to following endpoint: GET /slurm/v0.0.45/ping
Add status field to sacctmgr ping --json and sacctmgr ping --yaml.
Add status field to '.components.schemas."v0.0.45_slurmdbd_ping"' to following endpoint: GET /slurmdb/v0.0.45/ping
slurmctld - Require authentication for the 'GET /readyz?verbose' endpoint, restricting access to only root and SlurmUser.
slurmctld - Add threadpool to avoid overhead of creating new process threads which kernel freezes entire process to complete. This can be enabled with SlurmctldParameters=threadpool=enabled.
Fix building with --with-jwt in a non-standard location.
sacct - Add '.jobs[].sluid' field to the following commands: 'sacct --json', 'sacct --yaml'
slurmrestd - Add '.jobs[].sluid' field to the following endpoints: 'GET slurmdb/v0.0.45/job', 'GET slurmdb/v0.0.45/jobs'
slurmrestd - Add 'GET /healthz', 'GET /readyz', and 'GET /livez' endpoints.
Fix potential glibc deadlock when tearing down the extern step when x11 forwarding is enabled.
Fix FreeBSD build for --format=binary files, which are currently used for command help and usage text.
Packaging - MUNGE is now a weak dependency to Slurm RPM and DEB packages, and can now be optionally installed or removed (installed by default).
Add SuspendTime as a NodeName parameter in slurm.conf, enabling per-node power save configuration.
slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.42/nodes/ POST /slurm/v0.0.42/node/{node_name}
slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.43/nodes/ POST /slurm/v0.0.43/node/{node_name}
slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.44/nodes/ POST /slurm/v0.0.44/node/{node_name}
slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.45/nodes/ POST /slurm/v0.0.45/node/{node_name}
Adding new archive/purge options to allow for explicit archiving of job_scripts and job_env without jobs.
When the url_parser plugin does not load, change the log from an error to a warning. This plugin is optional and may not always be built.
Fix rpmbuild slurm.spec --with selinux.
Use internal dependency generator in slurm.spec.
Switch to pkgconfig detection of many packages in slurm.spec.
Add reqTRES components to the clonensscript and clonensepilog environment variables.
Name all process POSIX threads consistently with format "worker[{index}]" when threads are not otherwise given a special name.
slurmctld - Fix unresponsive nodes not being marked DOWN in clusters with frequent reconfigurations, as each reconfigure was updating the SlurmdTimeout countdown.
slurmctld - If a node is replaced in a reservation mark that the reservation state changed. With bf_continue enabled, this fixes backfill potential incorrect planning if reservation node is replaced mid-cycle.
Cover rare edge case in job queue sorting.
Add job priority value to SLURM_RESUME_FILE.
sbatch/srun/salloc - Make --gres=gpu:N and --gpus-per-node mutually exclusive.
switch/hpe_slingshot - Add SwitchParameters=fm_authdir_ctld option.
slurmd - Support POSIX signal SIGPROF to log debug state.
slurmd - Increase default conmgr_max_connections from 50 to 512 to avoid connections being deferred on nodes with high...

Assets 3

v25.11.5

14 Apr 21:19

@MarshallGarey MarshallGarey

slurm-25-11-5-1

ed147cc

v25.11.5

Changes in 25.11.5

slurmctld - Prevent crash when deleting the only node in the cluster which also belongs to an inactive reservation.
Fix assoc corruption on account add race condition.
slurmctld - Re-enforce accounting policy limits when updating a job's QOS/assoc/partition.
Prevent double call to requeue logic when PrologSlurmctld fails leading to extra records in database.
Fix backfill to honor partition OverSubscribe=EXCLUSIVE
stepmgr - Avoid leaking MPI ports when jobs that use the stepmgr are allocated nonconsecutive ports.
Fix always showing 0 for slurm_cpus_alloc, slurm_nodes_alloc and slurm_memory_alloc in the metrics/jobs endpoint.
Fix BPF token support compilation on systems with glibc >= 2.36 by using <sys/mount.h> where available instead of <linux/mount.h>.
Fix a regression in 25.11.0 that could cause bounded hang after hitting conmgr_max_connections.
Fix Insufficient Size error in NVML library call for long gpu names.
slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
slurmctld - Reject all job submissions as reserved user or group nobody(99).
sbatch,srun,salloc - Reject arg --uid=99.
sbatch,srun,salloc - Reject arg --gid=99.
Jobs that complete quickly will not be marked as runaway.
Correctly identify whether a job is in the DB.
slurmctld - Avoid possible race condition during shutdown that could cause a crash in the HTTP handling logic.
slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
Fix race condition with cgroups not migrating slurmd process quickly, which caused EBUSY errors on startup.
Fix slurmd reconfigure failure with cgroup/v2.
Fix a regression added in 25.05.0 concerning how the slurmctld inherits /run/slurmctld/sack.socket when using AuthType=auth/slurm to prevent clients that connected during a reconfigure from hanging indefinitely.
slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
Avoid failure for spank options that do not require arguments.
Allow archive load of qos_usage tables
namespace/linux - fix memory leak in slurmstepd when namespace_p_recv_stepd() fails.
namespace/linux - Fix potential crash on failure if mmap() or sem_init() fails during namespace construction.
namespace/linux - fix unlikely error that could cause sigkill to be sent to a job during shutdown.
namespace/linux - fix failure to detect namespace setup problems when launching a job.
Fix slurmctld crash when querying the metrics endpoint after a partition is deleted with finished jobs still present.
reservations - Fix creation with NodeCnt and Flags=IGNORE_JOBS failing when partition nodes are occupied.
cons_tres - Prevent slurmctld SIGFPE during node selection.

Assets 3

v25.11.4

12 Mar 20:59

@mcmult mcmult

slurm-25-11-4-1

bd34987

v25.11.4

Changes in 25.11.4

slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
Prevent "period_start should already be set" errors when purging slurmdbd data and fix file names for archives of purged slurmdbd data.
Skip x11 shutdown when x11 functionality was not requested.
Fix build errors with recent versions of libcurl (8.16+).
Fix scrun segfault with step_mgr and if environment is set.
Fix two memory leaks located in the job info struct.
Fix sacct not accepting -R flag.
switch/nvidia_imex - Fix parsing of --network=unique-channel-per-segment option.
topology/block - Fix parsing of --network=unique-channel-per-segment option.
Fix compile errors building against glibc-2.43
Prevent potential race that could cause process/script completion to go undetected. In the case of prolog/epilog, this would leave jobs stuck in CG state on nodes running many concurrent jobs. In the case of --get-user-env, it may time out resulting in jobs being requeued and held.
switch/nvidia_imex - fix use-after-free when switch plugin debug logging is enabled.
Fix bad umask() if switch/nvidia_imex fails to initialize.
switch/nvidia_imex - fix memory leak if imex_dev_major is set.
switch/nvidia_imex - fix potential memory leaks when unpacking the jobinfo structure.
switch/nvidia_imex - prevent job from starting when imex channel allocation fails.
When bf_continue is set, prevent backfill from potentially ending its cycle early due to the reason "System state changed" because of a node state change.
Fix underflow in GRES selection when RestrictedCoresPerGPU is configured and the job is exclusive.
Fix race on reconfigure that caused slurmctld to crash.
Docs - Update the version constraints for libjwt to reflect the fact that only 1.x may be used with Slurm.
Fix case when using sacctmgr where user assoc failed to be removed when removing an account with parent specified.
cgroup/v2 - Fix issue which caused memory.peak to be inconsistently used.
Prevent flex reservations from taking nodes from other reservations if those reservations do not request full nodes.
Fix slurmctld crash situation with srun --overcommit.
Adding log message to notify user of queries which are too large

Assets 3

v25.05.7

12 Mar 20:58

@MarshallGarey MarshallGarey

slurm-25-05-7-1

8548115

v25.05.7

Changes in 25.05.7

Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
Fix compile errors building against glibc-2.43
Fix race on reconfigure that caused slurmctld to crash

Assets 3

v25.11.3

19 Feb 22:13

@wickberg wickberg

slurm-25-11-3-1

c21c011

v25.11.3

Changes in 25.11.3

Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
Fixed issue where RestrictedCoresPerGPU with shared gres are limited to using restricted cores on one job per sharing gres.
slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
Fix "sacctmgr show conf" to properly display CommitDelay in seconds instead of as a boolean.
Fix cron/requeued jobs being incorrectly reported as runaway
slurmctld - Prevent the double-removal of accounting usage for jobs being requeued that are in the COMPLETED or COMPLETING state.
When deleting a QOS from the DB, also remove it from partition QOS, AllowQOS and DenyQOS fields.
Fixed bug that could cause the detected CPU count to be lower than actual available CPU count. This bug could have resulted in the default value for conmgr_threads being lower than the number of available CPUs in sackd, scrun, slurmctld, slurmscriptd, slurmd, slurmstepd, slurmdbd, and slurmrestd when the assigned CPUs are not sequential.
slurmdbd - Prevent the following slurmdbd.conf options from overriding the default values of any in the list not specified: AllowNoDefAcct, AllResourcesAbsolute, DisableCoordDBD, DisableArchiveCommands.
salloc/sbatch - Nesting a non-stepmgr salloc or sbatch inside an existing job allocation that enabled the stepmgr will no longer result in the inner job's steps failing to launch.
Prevent slurmd -G from initializing sack processing thread.
Added SLURM_CLUSTER_NAME, SLURM_JOB_ACCOUNT and SLURM_JOB_GROUP environment variables when a step is launched.
slurmctld - Prevent marking external nodes as being unresponsive when reconfiguring if SlurmctldParameters=enable_configless is used.
Fix potential segfault when attempting to look up the controller address via DNS in configless mode.
Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
slurmrestd - Avoid memory leak on authentication failures with invalid bearer tokens.
Fix potential deadlock in _x11_signal_handler() during stepd_cleanup().
slurmctld - Fix reservations AllowedPartitions logic leading to incorrect purge of valid reservations in some use-cases.
slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured.
Prevent potential controller segfault when reconfiguring after gres file updates.
Reparent slurmd to a subcgroup to avoid conflicting with systemd.
Fix sprio regression not handling comma separated list of jobids.
slurmctld,slurmd - Fix memory leak when container ID is populated.
slurmd - Fix P-core detection on processors with varying P-core frequencies and in cpuset-restricted environments.
namespace/linux - add disable_bpf_token option.
slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero.
slurmctld - Avoid expedited requeue of jobs while waiting for job epilog script to complete.
slurmctld - Prevent removing cloud nodes from the topology when putting them in the POWERED_DOWN state if they are present in topology.conf or topology.yaml and their node configuration did not specify the Topology option.
interfaces/topology - When modifying a nodes topology with the Topology option in slurm.conf or the slurmd --conf Topology, change the topology to fully match the new topology.
slurmctld - Allow changes to topology.conf or topology.yaml, and slurm.conf node configuration Topology option to take effect on a reconfigure or restart when power saving is enabled.
slurmctld - Prevent backfill from combining future timeslots if they have different license reservations.
Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
slurmdbd - Avoid race condition that could cause a hang during shutdown when incoming connection fails.
slurmdbd - Avoid crash during shutdown due to sacctmgr shutdown request.
Fix slurmctld assertion when using "enable_async_reply" and certmgr is used for a TLS enabled cluster.
Fix potential slurmd process leak when handling --get-user-env.
slurmcltd - Avoid race condition that could cause the StateSaveLocation updates to be missed during shutdown.
slurmcltd - Avoid race condition that could cause slurmctld to hang during shutdown before updating StateSaveLocation.
slurmctld - Avoid race condition that could cause shutdown to wait on the wrong thread.
Fix handling of 0 node test allocations in topology/block.
slurmctld - In backfill, prevent unnecessarily testing jobs at future times using the select plugin if it is guaranteed to fail.

Assets 3

v25.11.2

26 Jan 19:20

@wickberg wickberg

slurm-25-11-2-1

64d98c4

v25.11.2

Changes in 25.11.2

slurmstepd - Revert regression that would apply job environment to container runtime invocation.
Fix issue where reservations may start while required GRES resources are still being used by jobs.
Fix slurmctld segfault when using --consolidate-segments.
Expose slurm.CONSOLIDATE_SEGMENTS flag in lua.
Expose the job record's segment_size in lua.
job_submit/lua - Expose the job_desc's segment_size in lua.
Prevent PMIx 5.0.8 and 5.0.9 clients from hanging when connecting to the PMIx server.
Clarify warning when BPF tokens are not supported.
slurmctld - Ensure we close already accepted conn before RPC flush check
slurmctld - Fix rpc_queue feature causing statesave corruption while shutdown
slurmctld - Ensure backfill has finished before saving state.
slurmctld - Ensure main scheduler has finished before saving state.
slurmctld - Fix error message while shutting down and state cannot be saved.
Fix slurmctld double free that occurs when purging array jobs from memory only when using the topology/block plugin.
Fix steps being rejected inside a batch job when using --cpus-per-task and --mem-per-cpu, and the job was submitted to multiple partitions, but not all of them had the same MaxMemPerCPU limit in place.
slurmctld - Fix crash after failed reconfiguration while running jobs and priority/multifactor enabled.
slurmctld - Fix jobs' QOS/association usage leading to potential underflow errors after a failed reconfiguration attempt.
Guess NodeName with gethostname instead of gethostname_short
Fix allowing job submissions when EnforcePartLimits=NO and the requested minimum number of nodes exceeds the total nodes in the specified partition(s).
Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
srun - fix bug where some input/output/error filename format identifiers were not expanded.
Fix detecting restricted cores with SlurmdSpecOverride in nodes with more than one socket.
slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.
Fix average calculation in latency timers to show more accurate timing logs.

Assets 3

Releases: SchedMD/slurm

v26.05.1

Changes in 26.05.1

Uh oh!

v26.05.0

Changes in 26.05.0

Uh oh!

v25.11.6

Changes in 25.11.6

Uh oh!

v25.05.8

Changes in 25.05.8

Uh oh!

v26.05.0rc1

Changes in 26.05.0rc1

Uh oh!

v25.11.5

Changes in 25.11.5

Uh oh!

v25.11.4

Changes in 25.11.4

Uh oh!

v25.05.7

Changes in 25.05.7

Uh oh!

v25.11.3

Changes in 25.11.3

Uh oh!

v25.11.2

Changes in 25.11.2

Uh oh!