Coming soon: The Renovate bot (GitHub App) will be renamed to Mend. PRs from Renovate will soon appear from 'Mend'. Learn more here.
A vulnerability, which was classified as problematic, has been found in PyTorch 2.6.0+cu124. Affected by this issue is the function torch.mkldnn_max_pool2d. The manipulation leads to denial of service. An attack has to be approached locally. The exploit has been disclosed to the public and may be used.
A vulnerability, which was classified as problematic, was found in PyTorch 2.6.0. Affected is the function torch.nn.functional.ctc_loss of the file aten/src/ATen/native/LossCTC.cpp. The manipulation leads to denial of service. An attack has to be approached locally. The exploit has been disclosed to the public and may be used. The name of the patch is 46fc5d8e360127361211cb237d5f9eef0223e567. It is recommended to apply a patch to fix this issue.
I found a Remote Command Execution (RCE) vulnerability in PyTorch. When loading model using torch.load with weights_only=True, it can still achieve RCE.
This vulnerability was found by Ji'an Zhou.
pytorch/pytorch (torch)
v2.8.0: PyTorch 2.8.0 Release
Compare Source 
PyTorch 2.8.0 Release Notes
Highlights
  Unstable
 
  torch::stable::Tensor
 
  High-performance quantized LLM inference on Intel CPUs with native PyTorch
 
  Experimental Wheel Variant Support
 
  Inductor CUTLASS backend support
 
  Inductor Graph Partition for CUDAGraph
 
  Control Flow Operator Library
 
  HuggingFace SafeTensors support in PyTorch Distributed Checkpointing
 
  SYCL support in PyTorch CPP Extension API
 
  A16W4 on XPU Device
 
  Hierarchical compilation with torch.compile
 
  Intel GPU distributed backend (XCCL) support
 
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Tracked Regressions
Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)
Due to a bug introduced in CUDA 12.9.1, we are unable to complete full Windows wheel builds with this
version, as compilation of torch.segment_reduce() crashes the build. Thus, we provide a wheel
without torch.segment_reduce() included in order to sidestep the issue. If you need support
for torch.segment_reduce(), please utilize a different version.
Backwards Incompatible Changes
CUDA Support
Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517, #158478, #158744)
Due to binary size limitations, support for sm50 - sm60 architectures with CUDA 12.8 and 12.9 has
been dropped for the 2.8.0 release. If you need support for these architectures, please utilize
CUDA 12.6 instead.
Python Frontend
Calling an op with an input dtype that is unsupported now raises NotImplementedError instead of RuntimeError (#155470)
Please update exception handling logic to reflect this.
In 2.7.0
try:
 torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except RuntimeError:
 ...
In 2.8.0
try:
 torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except NotImplementedError:
 ...
Added missing in-place on view check to custom autograd.Function (#153094)
In 2.8.0, if a custom autograd.Function mutates a view of a leaf requiring grad,
it now properly raises an error. Previously, it would silently leak memory.
 class Func(torch.autograd.Function):
 @​staticmethod
 def forward(ctx, inp):
 inp.add_(1)
 ctx.mark_dirty(inp)
 return inp
 @​staticmethod
 def backward(ctx, gO):
 pass
 a = torch.tensor([1.0, 2.0], requires_grad=True)
 b = a.view_as(a)
 Func.apply(b)
Output:
Version 2.7.0
Runs without error, but leaks memory
Version 2.8.0
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation
An error is now properly thrown for the out variant of tensordot when called with a requires_grad=True tensor (#150270)
Please avoid passing an out tensor with requires_grad=True as gradients cannot be
computed for this tensor.
In 2.7.0
a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
##### does not error, but gradients for c cannot be computed
torch.tensordot(a, b, dims=([1], [0]), out=c)
In 2.8.0
a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
torch.tensordot(a, b, dims=([1], [0]), out=c)
##### RuntimeError: tensordot(): the 'out' tensor was specified and requires gradients, and
##### its shape does not match the expected result. Either remove the 'out' argument, ensure
##### it does not require gradients, or make sure its shape matches the expected output.
torch.compile
Specialization of a tensor shape with mark_dynamic applied now correctly errors (#152661)
Prior to 2.8, it was possible for a guard on a symbolic shape to be incorrectly
omitted if the symbolic shape evaluation was previously tested with guards
suppressed (this often happens within the compiler itself). This has been fixed
in 2.8 and usually will just silently "do the right thing" and add the correct
guard. However, if the new guard causes a tensor marked with mark_dynamic to become
specialized, this can result in an error. One workaround is to use
maybe_mark_dynamic instead of mark_dynamic.
See the discussion in issue #157921 for more
context.
Version 2.7.0
import torch
embed = torch.randn(2, 8192)
x = torch.zeros(8192)
torch._dynamo.mark_dynamic(x, 0)
@​torch.compile
def f(embedding_indices, x):
 added_tokens_mask = torch.where(x > 10000, 1, 0)
 ei = torch.narrow(embedding_indices, 1, 0, x.size(0))
 return ei.clone()
f(embed, x)
Version 2.8.0
import torch
embed = torch.randn(2, 8192)
x = torch.zeros(8192)
torch._dynamo.maybe_mark_dynamic(x, 0)
@​torch.compile
def f(embedding_indices, x):
 added_tokens_mask = torch.where(x > 10000, 1, 0)
 ei = torch.narrow(embedding_indices, 1, 0, x.size(0))
 return ei.clone()
f(embed, x)
Several config variables related to torch.compile have been renamed or removed
- Dynamo config variable enable_cpp_framelocals_guard_evalhas changed to no longer have any effect (#151008).
- Inductor config variable rocm.n_max_profiling_configsis deprecated (#152341).
 Instead, use ck-tile based configsrocm.ck_max_profiling_configsand
 rocm.ck_tile_max_profiling_configs.
- Inductor config variable autotune_fallback_to_atenis deprecated (#154331).
 Inductor will no longer silently fall back toATen. Please add"ATEN"to
 max_autotune_gemm_backendsfor the old behavior.
- Inductor config variables use_mixed_mmandmixed_mm_choiceare deprecated (#152071). Inductor now supports prologue fusion, so there is no need for
 special cases now.
- Inductor config setting descriptive_names = Falseis deprecated (#151481). Please use one of the other available
 options:"torch","original_aten", or"inductor_node".
- custom_op_default_layout_constrainthas moved from inductor config to functorch config (#148104). Please reference it via
 - torch._functorch.config.custom_op_default_layout_constraintinstead of
 - torch._inductor.config.custom_op_default_layout_constraint.
- AOTI config variable emit_current_arch_binaryis deprecated (#155768).
- AOTI config variable aot_inductor.embed_cubinhas been renamed toaot_inductor.embed_kernel_binary(#154412).
- AOTI config variable aot_inductor.compile_wrapper_with_O0has been renamed tocompile_wrapper_opt_level(#148714).
Added a stricter aliasing/mutation check for HigherOrderOperators (e.g. cond), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953, #146658).
For affected HigherOrderOperators, add .clone() to aliased outputs to address this.
Version 2.7.0
import torch
@​torch.compile(backend="eager")
def fn(x):
 return torch.cond(x.sum() > 0, lambda x: x, lambda x: x + 1, [x])
fn(torch.ones(3))
Version 2.8.0
import torch
@​torch.compile(backend="eager")
def fn(x):
 return torch.cond(x.sum() > 0, lambda x: x.clone(), lambda x: x + 1, [x])
fn(torch.ones(3))
guard_or_x and definitely_x have been consolidated (#152463)
We removed definitely_true / definitely_false and associated APIs, replacing them with
guard_or_true / guard_or_false, which offer similar functionality and can be used to
achieve the same effect. Please migrate to the latter.
Version 2.7.0
from torch.fx.experimental.symbolic_shapes import definitely_false, definitely_true
...
if definitely_true(x):
 ...
if definitely_false(y):
 ...
Version 2.8.0
from torch.fx.experimental.symbolic_shapes import guard_or_false, guard_or_true
...
if guard_or_false(x):
 ...
##### alternatively: if guard_or_false(torch.sym_not(y))
if not guard_or_true(y):
 ...
torch.export
torch.export.export_for_inference has been removed in favor of torch.export.export_for_training().run_decompositions() (#149078)
Version 2.7.0
import torch
...
exported_program = torch.export.export_for_inference(mod, args, kwargs)
Version 2.8.0
import torch
...
exported_program = torch.export.export_for_training(
 mod, args, kwargs
).run_decompositions(decomp_table=decomp_table)
Switched default to strict=False in torch.export.export and export_for_training (#148790, #150941)
This differs from the previous release default of strict=True. To revert to the old default
behavior, please explicitly pass strict=True.
Version 2.7.0
import torch
##### default behavior is strict=True
torch.export.export(...)
torch.export.export_for_training(...)
Version 2.8.0
import torch
##### strict=True must be explicitly passed to get the old behavior
torch.export.export(..., strict=True)
torch.export.export_for_training(..., strict=True)
ONNX
Default opset in torch.onnx.export is now 18 (#156023)
When dynamo=False, the default ONNX opset version has been updated from 17 to 18. Users can set opset_version to explicitly select an opset version.
Version 2.7
##### opset_version=17
torch.onnx.export(...)
Version 2.8
##### To preserve the original behavior
torch.onnx.export(..., opset_version=17)
##### New: opset_version=18
torch.onnx.export(...)
The JitTraceConvertStrategy has been removed (#152556)
Support for JIT traced and scripted modules in the ONNX exporter when dynamo=True has been removed. You are encouraged to export an nn.Module directly, or create an ExportedProgram using torch.export before exporting to ONNX.
onnxscript>=0.3.1 is required for the dynamo=True option (#157017)
You must upgrade onnxscript to version 0.3.1 or higher for it to be compatible with PyTorch 2.8.
Build Frontend
Removed the torch/types.h include from Dispatcher.h (#149557)
This can cause build errors in C++ code that implicitly relies on this include (e.g. very old versions of torchvision).
Note that Dispatcher.h does not belong as an include from torch/types.h and was only present as a
short-term hack to appease torchvision. If you run into torchvision build errors, please
update to a more recent version of torchvision to resolve this.
Upgraded DLPack to 1.0 (#145000)
As part of the upgrade, some of the DLDeviceType enum values have been renamed. Please switch
to the new names.
Version 2.7.0
from torch.utils.dlpack import DLDeviceType
d1 = DLDeviceType.kDLGPU
d2 = DLDeviceType.kDLCPUPinned
...
Version 2.8.0
from torch.utils.dlpack import DLDeviceType
d1 = DLDeviceType.kDLCUDA # formerly kDLGPU
d2 = DLDeviceType.kDLCUDAHost # formerly kDLCPUPinned
...
NVTX3 code has been moved from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583)
This is a BC-breaking change for the build system interface. Downstream projects that previously got NVTX3 through cmake/public/cuda.cmake
(i.e.. calling find_package(TORCH REQUIRED)) will now need to explicitly configure NVTX3 support in the library itself (i.e. use USE_SYSTEM_NVTX=1).
The change is to fix the broken behavior where downstream projects couldn't find NVTX3 anyway due to the PROJECT_SOURCE_DIR mismatch.
Version 2.7.0:
- A downstream project using -DUSE_SYSTEM_NVTXwould be able to find NVTX3 andtorch::nvtx3via PyTorch'scmake/public/cuda.cmakelogic.
- A downstream project NOT using -DUSE_SYSTEM_NVTXwould encounter build errors with CUDA 12.8 or above.
Version 2.8.0:
- A downstream project using -DUSE_SYSTEM_NVTXwill not be able to find NVTX3 ortorch::nvtx3via PyTorch'scmake/public/cuda.cmake. The downstream project now needs to explicitly find NVTX3 and torch::nvtx3 by implementing the same logic in PyTorch'scmake/Dependences.cmake.
- A downstream project NOT using -DUSE_SYSTEM_NVTXwill proceed building without NVTX unless another part of the build process re-enables NVTX.
Deprecations
MPS support for MacOS Ventura will be removed in 2.9
PyTorch 2.8 is the last release that will support GPU acceleration on MacOS Ventura. In the next
release (2.9), MacOS Sonoma (released in Sept. 2023) or above will be required to use the MPS
backend.
torch.ao.quantization is deprecated and will be removed in 2.10 (#153892)
To migrate:
- Eager mode quantization (torch.ao.quantization.quantize,torch.ao.quantization.quantize_dynamic)
- Weight-only and dynamic quantization: use torchaoeager modequantize_.
- Static quantization: use torchaoPT2E quantization.
 
- FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx): usetorchaoPT2E quantization (torchao.quantization.quantize_pt2e.prepare_pt2e,torchao.quantization.quantize_pt2e.convert_pt2e).
Note that PT2E quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e). See pytorch/ao#2259 and https://docs.pytorch.org/ao/main/quick_start.html#pytorch-2-export-quantization for more details.
The dynamo=False (current default) option for torch.onnx.export is deprecated (#152478, #155580)
The default will be dynamo=True starting from PyTorch 2.9. You are encouraged to migrate to use the dynamo=True option in torch.onnx.export. This flag makes torch.export.export the default export path, replacing TorchScript.
To maintain the old behavior, set dynamo=False explicitly. You are encouraged to also experiment with the fallback=True option that will make the exporter fall back to the dynamo=False path if there are errors.
New Features
CUDA
- Support capture of event record and wait in CUDAGraphs for timing (#155372)
torch.compile
Dynamo
- Added support for hierarchical compilation via nested_compile_region(#156449)
- Allow guards to be dropped with custom filter functions via guard_filter_fn(#150936)
- Added dont_skip_tracingdecorator to skip over most Dynamoskipfilesrules (#150586)
Inductor
- Added support for mapping a Dynamo graph to multiple different Inductor graphs, which can be optimized separately (#147648, #147038)
torch.export
- Introduced draft-export, an export variant designed to consistently produce a graph and generate a debugging report of issues encountered during tracing (#152637, #153219, #149465, #153627, #154190, #155744, #150876, #150948, #151051, #151065, #150809, #151797)
Ahead-Of-Time Inductor (AOTI)
- Added support for TorchBindobjects (#150196, #154265)
- Added config variable aot_inductor.model_name_for_generated_filesfor specifying model name (#154129)
MPS
- MPSInductor:- torch.compilefor Apple GPUs (#150121, #149342, #151449, #151754, #149687, #149180, #149221, #153598, #152788, #153787, #152214, #151152, #155891, #154578, #151272, #151288, #153997, #151871, #153362, #156566, #150661, #153582)
ONNX
- 
Added new strategy draft_export(#147529, docs) to provide debugging information upon data-dependent / constraint errors when obtaining anExportedProgramwithtorch.onnx.export
 
- 
Added support for symbolic operators in the dynamo=Trueexport path (#148905, #149678, #150038, docs). Two operatorstorch.onnx.ops.symbolicandtorch.onnx.ops.symbolic_multi_outare defined to allow you to create symbolic ONNX operators directly in your PyTorch models. You can use them in aforwardmethod:
 
def forward(self, x: torch.Tensor) -> torch.Tensor:
##### Optionally use is_in_onnx_export to control the behavior during onnx export
 if torch.onnx.is_in_onnx_export():
##### Create a symbolic ONNX operator with the name "CustomOp" in the "custom_domain" domain.
##### The output tensor will have the specified dtype and shape
 return torch.onnx.ops.symbolic(
 "custom_domain::CustomOp",
 (x,),
 dict(attr_key="attr_value"),
 dtype=x.dtype,
 shape=x.shape,
 version=1,
 )
 else:
 return x
Python Frontend
- Added Generalized Pareto Distribution (GPD) (#135968)
Quantization
- Introduced torch.float4_e2m1fn_x2dtype (#148791)
XPU
- Support Intel distributed backend (XCCL) (#141856)
- Support SYCL kernels through C++ extension (#132945)
Improvements
Build Frontend
Composability
C++ Frontend
- Exposed bicubicmode fortorch::nn::functional::grid_sample(#150817)
CUDA
- Introduced no_implicit_headersmode forload_inline()on custom CUDA extensions (#149480)
- Support large batch sizes in SDPA memory-efficient attention backend (#154029, #154663)
- Fixed invalid indexing in SDPA memory-efficient attention backward (#155397)
- Support SDPA attention backends on sm121 (DGX Spark) (#152314)
- Added FP8 row-wise scaled-mm for sm12x (GeForce Blackwell) (#155991)
cuDNN
- Updated cuDNN frontend version to 1.12 (#153888)
Distributed
c10d
- Enhanced TCPStorewith clone and queuing features (#150966, #151045, #150969, #151485)
- Added a collective time estimator for NCCL comms (#149343)
- Made getDefaultBackendmore fault tolerant without relying on exceptions (#149152)
- Specified the default PyTorch Distributed backend for MPS (#149538)
- Supported masterListenFdinTCPStoreLibUvBackend(#150215)
- Used shared stores in gloo (#150230)
- Improved FR dump robustness with all watchdog broadcast wait, reduce dump timeout and shrinked mutex range (#150652, #151329, #155949)
- Added the record of each individual collective being coalesced in FR (#151238)
- Implemented safer book-keeping of NCCL communicators (#150681)
- Clarified behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK(#150682)
- Registered also future allocations in mempool with NCCL (#150684)
- Avoided computing global_rankwhengroup_rankis used (#151373)
- Exposed NCCL communicator from ProcessGroupNCCLvia an unsafe API (#152496)
- Added split sizes info dump for uneven all2all bw calculation (#151438)
- Made FR vendor neutral so that other backends can use it and integrated into gloo. (#152585, #152563, #154929, #152614)
- Added needs_contiguous_stridestag in functional collective (#153399, #153523)
- Allowed split_groupto work with non-nccl backends (#152175)
- Simplified new_subgroups()by usingnew_subgroups_by_enumeration()(#153843)
- Made only current thread allocate to pool in ProcessGroupNCCL(#153990)
- Enabled using c10::Halffor gloo (#153862)
- Released GIL in PG destructor (#154976)
- Enhanced get_process_group_ranks()to acceptgroup=None(#154902)
- Skipped updating the default device distributed backend if already registered (#155320)
- Enabled querying the build and runtime NCCL versions (#156305)
- Disabled NCCL NVLS when using deterministic mode (#156381)
- Made init_process_groupsupport index-only device id (#156214)
- Support enabling / disabling NaN detector per-ProcessGroup(#151723)
- Added support for reduce_scatterandReduceOp::AVGinProcessGroupGloo(#149781, #149869)
- Added FP8 support in ProcessGroupNCCL(#152706)
- Added ibverbsbackend in gloo and enabled gloo CUDA when used with a backend that supportsGPUDirect(#153015, #153425, #153406)
DeviceMesh
- Improved device selection logic (#150897)
DistributedDataParallel (DDP)
- Added one option to allow skipping all reduce unused parameters (#151503)
- Added check on received data to avoid segfault in the DDP reducer (#152143)
- Propagated use_python_reducerto C++ reducer (#152735)
 DistributedStateDict(DSD)
- Supported non-tensor-data write_sizein planner write items (#149699)
- Skip popping meta device tensors (#153185)
DTensor
- Made StridedShardsupport uneven sharding (#150490)
- Added op support for torch.cumsum(#151071)
- Added DTensorredistributefwd/bwd datatype conversion to enableSimpleFSDPmixed precision training (#150740)
- Added rich support to torch.distributed.tensor.debug.visualize_sharding(#152027)
FullyShardedDataParallel2 (FSDP2)
- Added PrivateUse1backend in FSDP collectives and device type to pre forward hook (#147260, #149487)
- Added set_reshard_after_forward(#149103)
- Allowed different dtypes for no grad model params (#154103)
- Respected reshard_after_forward=Truefor root model and kept root unsharded when not specifyingreshard_after_forward(#154704, #155319)
- Allowed forcing FSDP2 to always use SUM reductions (#155915)
- Made assert on all_reduce_eventonly if it's not CPU device (#150316)
- Enabled NCCL zero-copy (user buffer registration) for FSDP2 (#150564)
Pipeline Parallelism
- Added schedule visualizer (#150347)
- Allowed unused kwargs in ZB path (#153498)
- Added get_pipeline_order()for Gpipe and 1F1B (#155935)
ShardedTensor
- Added support for 0-size ShardedTensorand recalculated metadata fromall_gather(#152583)
TensorParallel
- Added a ParallelStyle PrepareModuleInputOutput(#150372)
torchelastic
- No shutdown of rendezvous on leaving workers (#152525)
torch.compile
Dynamo
- Improved tracing support for python sets, tensor subclasses with __torch_function__, andnamedtuplesubclasses (#153150, #149792, #153982)
- Eliminated all Compiled Autograd dynamic shapes recompiles for compile time reduction (#151962, #152119,
 #151962, #149707, #149709,
 #148799, #148801)
- Added reasonfield totorch.compiler.disable(#150341)
- Removed lru_cachewarnings for functions in the top-leveltorchnamespace (#157718)
Inductor
- Added block sparse support for FlexAttention on CPU (#147196)
- Introduced new config settings:
- aot_inductor.custom_ops_to_c_shimsand- aot_inductor.custom_op_libs: allow for specifying custom op C shim (#153968)
- max_fusion_buffer_group_pairwise_attempts: limits fusions to specified node distance (#154688)
- cuda.cutlass_enabled_ops: controls CUTLASS operation selection (#155770)
- triton.cudagraph_capture_sizes: allows specifying certain shapes for which to capture CUDAGraphs; skips CUDAGraphs for other shapes (#156551)
- use_static_cuda_launcher: enables launching compiled triton statically to improve cold start times (#148890)
- assume_unaligned_fallback_output: allows inductor to track unaligned outputs (#150777)
- cuda.cutlass_tma_only: controls whether or not to only use TMA-compatible kernels in CUTLASS (#152815)
- static_launch_user_defined_triton_kernels: enables statically launching user defined triton kernels (#153725)
- precompilation_timeout_seconds: controls the timeout on precompilation (#153788)
- disable_decompose_k: disables new- DecomposeKGEMM Kernels (#154421)
- min_num_split: sets the minimum number of splits in a split reduction (#155941)
- max_autotune_flex_search_space: allows specifying the size of the search space for flex attention autotuning (#156307)
 
- Introduced environment variable LOG_AUTOTUNE_RESULTSfor autotune log (#156254)
- Improved numerical stability of CPU Welford reduction for normalizations (#145061)
torch.export
- Improved handling of builtin ops (min,max,math.pow) (#151348)
- Added min/max ranges for dim hints (#149590)
- Allow registering normal classes to pytree.register_dataclass(#147752)
- Allow specifying integer inputs as dynamic (#151842)
- Inline jit.scripted functions in export (#155180)
- Pretty printing for graph signature (#149710)
Ahead-Of-Time Inductor (AOTI)
- Support for device-side TMA (#157241)
- Added num_runnerstoAOTIModelPackageLoader(#149364)
FX
- Updated codegen compare op to ==(#150611)
- Map names to operand indices when const folding submodules (#150692)
- Improved stacktrace when tracing (#151029, #155486)
- Support edge dialect ops in normalize_function(#143689)
- Fixed path naming in minifier (#153130)
- Added graph_code_verbose_logartifact for FX passes (#153775)
- Improved cache key graph printing performance (#151928)
- Added flag to fx.passes.split_moduleto normalize input names (#157793)
Linear Algebra Frontend
- Add tensor overlap check for cross(#154999)
MPS
- Added support for a number of torch.specialoperations as well asindex_copy,hardshrink,rsub,col2im, andisin(#149174, #149203 #149123, #149368, #149378, #149563, #149687, #149705, #149783, #149407/#149680, #150279, #151754, #153786, #154326, #155304, #156263, #155382, #154010, #149816, #152282, #156090, #150060, #151600, #155002, #154671)
- Extended dtype support for:
- index_putwith half precision floats (#151869)
- ConvTranspose3Dwith FP32 and complex (#154696)
- log1pand- sigmoidwith int64 (#151791)
 
- Compute activation kernels at float precision (#155735)
Nested Tensor (NJT)
- Fixed contiguity in NJT string representation (#153529)
torch.nn
- Added warning for module full backward hook when no input requires gradient (#155339)
- Added Half support for weight_normon CPU (#148878)
ONNX
- Updated ONNX to 1.18 (#152200)
- Added support for opsets (18-23) when dynamo=True(#149901, #154596)
- Added float4 support (#151069, #156353)
- Added support for ONNX operators Attention-23andRotaryEmbedding-23as native PyTorch ops (#156431, #156367, #154745)
- Added support for torch.scan(#154513)
- Added support for 0/1-sized example inputs on dynamic dimensions (#155717)
- Add group_normsupport from opset 21 (#152138)
- Added asdictmethod toVerificationInfoclass (#151024)
- Support running bfloat16 models with ONNX Runtime (#149646)
- Updated ONNX program doc formatting and improve robustness (#151623)
- Updated dynamic_shapesbehavior to usetorch.export.dim.DYNAMIC(#153065)
- Set the name of the producing node using the value name (#155413)
- Improved support for symbolic operators sym_float,sym_not,sym_min,sym_max(#153200, #152111, #152196)
Optimizer
- Added TensorLRvariant for fused Adagrad on CPU (#153078)
- Convert tensor lr to 0-dim as needed for the optimizer to normally work (#145674)
- Added lr_lambdatype check inMultiplicativeLR(#151973)
Profiler
- Added support for on-demand memory snapshot (#150559)
- Added PT2 compile context to visualizer (#152862)
- Added PT2 to memory snapshot (#152707)
- Added flag to toggle global and local callbacks for annotations (#154932)
- Pass overload names to Kineto (#149333)
- Set duration to -1 for unfinished CPU events (#150131)
- Start at index with most events (#154571)
Python Frontend
- Introduced torch.AcceleratorError(#152023)
- Implemented Size.__radd__()(#152554)
- Updated get_default_device()to also respecttorch.devicecontext manager (#148621)
Quantization
- Improved x86 PT2E quantization support with new uint8 ops (pointwise mul/add/add_reluandbatch_norm2d), qconv1d-relu fusion, and lowering pass (#151112, #152411, #152811, #150751, #149708)
- Support boolean tensor for torch.fused_moving_avg_obs_fake_quanton CUDA (#153699)
Release Engineering
ROCm
- Allow user to override default flags for cpp_extension(#152432)
- Enabled support for sparse compressed mm/bmm/addmm(#153262)
Sparse Frontend
- Enabled sparse compressed tensor invariant checks for PrivateUse1extension (#149374)
torch.func
- Add batching rules for ops: torch.Tensor.scatter_add_(#150543),torch.matrix_exp(#155202)
XPU
- Support safe softmax, GQA, fp32 causal mask for SDP and increase maximum head dim from 256 to 576 on Intel GPU (#151999, #150992, #152091)
- Add memory reporting to Memory Profiler for Intel GPU (#152842)
- Support Intel GPU profiler toggle functionality (#155135)
- Support distributed memory tracker integration for Intel GPU (#150703)
- Improved error handling and reporting in Intel GPU CMake files (#149353)
- Support embed_cubinandmulti_arch_kernel_binaryoptions in AOTI for Intel GPU (#154514, #153924)
- Added generic and Intel GPU specific Stream and Event in UserDefineClass(#155787)
- Support int4 WOQ GEMM on Intel GPU (#137566)
Bug Fixes
Build Frontend
- Support builds with CMake-4.x(#150203)
- Fixed fbgemm build with gcc-12+(#150847)
- Force build to conform to C++ standard on Windows by adding /permissive-flag (#149035)
Composability
- Fixed support for 1-element tuple returns from custom ops (#155447)
- Avoid overflow in torch.normfor scalar input (#144073)
CPU (x86)
- Fixed apparent copy-paste bug in log_softmaxreduced-precision fp kernel (#156379)
CUDA
- Fixed deterministic indexing with broadcast (#154296)
- Fixed torch.backends.cuda.matmul.allow_fp16_accumulationcrash when using cuBLASLt (#153083)
- Enable AsyncMMon Blackwell (#153519)
- Fixed torch.cuda.MemPoolfor multithreaded use-cases (#153356)
- Fix to avoid calling sum()on a default-constructed gamma / beta inlayer_norm(#156600)
- Avoid hangs by erroring out for negative offsets or K=0 in grouped GEMMs (#153226)
- Don't error out in empty_cacheunder mempool context (#158180)
Distributed
c10d
- Fixed extra CUDA context created by barrier (#149144)
- Fixed the logic to use group rank instead of global rank when possible (#149488)
- Fixed ET trace collection of all_to_all(#149485)
- Disabled start event recording for coalesced col and improved profile title (#150863)
- Fixed connection reset in tcp store (#150987, #151052)
- Fixed unused groupinput argument innew_subgroups()(#152765, #153798)
- Fixed tcp init when using port 0 (#154156)
- Adopted a vector to temporarily keep the reference to future object to avoid blocking inside Flight Recorder (#156653)
Distributed Checkpointing (DCP)
- Fixed to use global coordinator rank in broadcast_objectutil function (#155912)
DistributedDataParallel (DDP)
- Fixed DDPOptimizerissue on static tensor index (#155746)
DTensor
- Fixed local_mapwith multi-threading (#149070)
- Fixed new_local_tensorinredistributebe None case ([#152303](h
 
Uh oh!
There was an error while loading. Please reload this page.
Coming soon: The Renovate bot (GitHub App) will be renamed to Mend. PRs from Renovate will soon appear from 'Mend'. Learn more here.
This PR contains the following updates:
==2.3.0->==2.8.0Warning
Some dependencies could not be looked up. Check the Dependency Dashboard for more information.
GitHub Vulnerability Alerts
CVE-2025-2953
A vulnerability, which was classified as problematic, has been found in PyTorch 2.6.0+cu124. Affected by this issue is the function torch.mkldnn_max_pool2d. The manipulation leads to denial of service. An attack has to be approached locally. The exploit has been disclosed to the public and may be used.
CVE-2025-3730
A vulnerability, which was classified as problematic, was found in PyTorch 2.6.0. Affected is the function torch.nn.functional.ctc_loss of the file aten/src/ATen/native/LossCTC.cpp. The manipulation leads to denial of service. An attack has to be approached locally. The exploit has been disclosed to the public and may be used. The name of the patch is 46fc5d8e360127361211cb237d5f9eef0223e567. It is recommended to apply a patch to fix this issue.
CVE-2025-32434
Description
I found a Remote Command Execution (RCE) vulnerability in PyTorch. When loading model using torch.load with weights_only=True, it can still achieve RCE.
Background knowledge
https://github.com/pytorch/pytorch/security
As you can see, the PyTorch official documentation considers using
torch.load()withweights_only=Trueto be safe.image
Since everyone knows that weights_only=False is unsafe, so they will use the weights_only=True to mitigate the seucirty issue.
But now, I just proved that even if you use weights_only=True, it can still achieve RCE.
Credit
This vulnerability was found by Ji'an Zhou.
Release Notes
pytorch/pytorch (torch)
v2.8.0: PyTorch 2.8.0 ReleaseCompare Source
PyTorch 2.8.0 Release Notes
Highlights
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Tracked Regressions
Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)
Due to a bug introduced in CUDA 12.9.1, we are unable to complete full Windows wheel builds with this
version, as compilation of
torch.segment_reduce()crashes the build. Thus, we provide a wheelwithout
torch.segment_reduce()included in order to sidestep the issue. If you need supportfor
torch.segment_reduce(), please utilize a different version.Backwards Incompatible Changes
CUDA Support
Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517, #158478, #158744)
Due to binary size limitations, support for sm50 - sm60 architectures with CUDA 12.8 and 12.9 has
been dropped for the 2.8.0 release. If you need support for these architectures, please utilize
CUDA 12.6 instead.
Python Frontend
Calling an op with an input dtype that is unsupported now raises
NotImplementedErrorinstead ofRuntimeError(#155470)Please update exception handling logic to reflect this.
In 2.7.0
In 2.8.0
Added missing in-place on view check to custom
autograd.Function(#153094)In 2.8.0, if a custom
autograd.Functionmutates a view of a leaf requiring grad,it now properly raises an error. Previously, it would silently leak memory.
Output:
Version 2.7.0
Version 2.8.0
An error is now properly thrown for the out variant of
tensordotwhen called with arequires_grad=Truetensor (#150270)Please avoid passing an out tensor with
requires_grad=Trueas gradients cannot becomputed for this tensor.
In 2.7.0
In 2.8.0
torch.compile
Specialization of a tensor shape with
mark_dynamicapplied now correctly errors (#152661)Prior to 2.8, it was possible for a guard on a symbolic shape to be incorrectly
omitted if the symbolic shape evaluation was previously tested with guards
suppressed (this often happens within the compiler itself). This has been fixed
in 2.8 and usually will just silently "do the right thing" and add the correct
guard. However, if the new guard causes a tensor marked with
mark_dynamicto becomespecialized, this can result in an error. One workaround is to use
maybe_mark_dynamicinstead ofmark_dynamic.See the discussion in issue #157921 for more
context.
Version 2.7.0
Version 2.8.0
Several config variables related to
torch.compilehave been renamed or removedenable_cpp_framelocals_guard_evalhas changed to no longer have any effect (#151008).rocm.n_max_profiling_configsis deprecated (#152341).Instead, use ck-tile based configs
rocm.ck_max_profiling_configsandrocm.ck_tile_max_profiling_configs.autotune_fallback_to_atenis deprecated (#154331).Inductor will no longer silently fall back to
ATen. Please add"ATEN"tomax_autotune_gemm_backendsfor the old behavior.use_mixed_mmandmixed_mm_choiceare deprecated (#152071). Inductor now supports prologue fusion, so there is no need forspecial cases now.
descriptive_names = Falseis deprecated (#151481). Please use one of the other availableoptions:
"torch","original_aten", or"inductor_node".custom_op_default_layout_constrainthas moved from inductor config to functorch config (#148104). Please reference it viatorch._functorch.config.custom_op_default_layout_constraintinstead oftorch._inductor.config.custom_op_default_layout_constraint.emit_current_arch_binaryis deprecated (#155768).aot_inductor.embed_cubinhas been renamed toaot_inductor.embed_kernel_binary(#154412).aot_inductor.compile_wrapper_with_O0has been renamed tocompile_wrapper_opt_level(#148714).Added a stricter aliasing/mutation check for
HigherOrderOperators (e.g.cond), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953, #146658).For affected
HigherOrderOperators, add.clone()to aliased outputs to address this.Version 2.7.0
Version 2.8.0
guard_or_xanddefinitely_xhave been consolidated (#152463)We removed
definitely_true/definitely_falseand associated APIs, replacing them withguard_or_true/guard_or_false, which offer similar functionality and can be used toachieve the same effect. Please migrate to the latter.
Version 2.7.0
Version 2.8.0
torch.export
torch.export.export_for_inferencehas been removed in favor oftorch.export.export_for_training().run_decompositions()(#149078)Version 2.7.0
Version 2.8.0
Switched default to
strict=Falseintorch.export.exportandexport_for_training(#148790, #150941)This differs from the previous release default of
strict=True. To revert to the old defaultbehavior, please explicitly pass
strict=True.Version 2.7.0
Version 2.8.0
ONNX
Default opset in
torch.onnx.exportis now 18 (#156023)When
dynamo=False, the default ONNX opset version has been updated from 17 to 18. Users can setopset_versionto explicitly select an opset version.Version 2.7
Version 2.8
The
JitTraceConvertStrategyhas been removed (#152556)Support for JIT traced and scripted modules in the ONNX exporter when
dynamo=Truehas been removed. You are encouraged to export an nn.Module directly, or create anExportedProgramusingtorch.exportbefore exporting to ONNX.onnxscript>=0.3.1is required for thedynamo=Trueoption (#157017)You must upgrade
onnxscriptto version 0.3.1 or higher for it to be compatible with PyTorch 2.8.Build Frontend
Removed the
torch/types.hinclude fromDispatcher.h(#149557)This can cause build errors in C++ code that implicitly relies on this include (e.g. very old versions of
torchvision).Note that
Dispatcher.hdoes not belong as an include fromtorch/types.hand was only present as ashort-term hack to appease
torchvision. If you run intotorchvisionbuild errors, pleaseupdate to a more recent version of
torchvisionto resolve this.Upgraded
DLPackto 1.0 (#145000)As part of the upgrade, some of the
DLDeviceTypeenum values have been renamed. Please switchto the new names.
Version 2.7.0
Version 2.8.0
NVTX3 code has been moved from
cmake/public/cuda.cmaketocmake/Dependencies.cmake(#151583)This is a BC-breaking change for the build system interface. Downstream projects that previously got NVTX3 through
cmake/public/cuda.cmake(i.e.. calling
find_package(TORCH REQUIRED)) will now need to explicitly configure NVTX3 support in the library itself (i.e. useUSE_SYSTEM_NVTX=1).The change is to fix the broken behavior where downstream projects couldn't find NVTX3 anyway due to the
PROJECT_SOURCE_DIRmismatch.Version 2.7.0:
-DUSE_SYSTEM_NVTXwould be able to find NVTX3 andtorch::nvtx3via PyTorch'scmake/public/cuda.cmakelogic.-DUSE_SYSTEM_NVTXwould encounter build errors with CUDA 12.8 or above.Version 2.8.0:
-DUSE_SYSTEM_NVTXwill not be able to find NVTX3 ortorch::nvtx3via PyTorch'scmake/public/cuda.cmake. The downstream project now needs to explicitly find NVTX3 and torch::nvtx3 by implementing the same logic in PyTorch'scmake/Dependences.cmake.-DUSE_SYSTEM_NVTXwill proceed building without NVTX unless another part of the build process re-enables NVTX.Deprecations
MPS support for MacOS Ventura will be removed in 2.9
PyTorch 2.8 is the last release that will support GPU acceleration on MacOS Ventura. In the next
release (2.9), MacOS Sonoma (released in Sept. 2023) or above will be required to use the MPS
backend.
torch.ao.quantizationis deprecated and will be removed in 2.10 (#153892)To migrate:
torch.ao.quantization.quantize,torch.ao.quantization.quantize_dynamic)torchaoeager modequantize_.torchaoPT2E quantization.torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx): usetorchaoPT2E quantization (torchao.quantization.quantize_pt2e.prepare_pt2e,torchao.quantization.quantize_pt2e.convert_pt2e).Note that PT2E quantization has been migrated to
torchao(https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e). See pytorch/ao#2259 and https://docs.pytorch.org/ao/main/quick_start.html#pytorch-2-export-quantization for more details.The
dynamo=False(current default) option fortorch.onnx.exportis deprecated (#152478, #155580)The default will be
dynamo=Truestarting from PyTorch 2.9. You are encouraged to migrate to use thedynamo=Trueoption intorch.onnx.export. This flag makestorch.export.exportthe default export path, replacingTorchScript.To maintain the old behavior, set
dynamo=Falseexplicitly. You are encouraged to also experiment with thefallback=Trueoption that will make the exporter fall back to thedynamo=Falsepath if there are errors.New Features
CUDA
torch.compile
Dynamo
nested_compile_region(#156449)guard_filter_fn(#150936)dont_skip_tracingdecorator to skip over most Dynamoskipfilesrules (#150586)Inductor
torch.export
draft-export, an export variant designed to consistently produce a graph and generate a debugging report of issues encountered during tracing (#152637, #153219, #149465, #153627, #154190, #155744, #150876, #150948, #151051, #151065, #150809, #151797)Ahead-Of-Time Inductor (AOTI)
TorchBindobjects (#150196, #154265)aot_inductor.model_name_for_generated_filesfor specifying model name (#154129)MPS
MPSInductor:torch.compilefor Apple GPUs (#150121, #149342, #151449, #151754, #149687, #149180, #149221, #153598, #152788, #153787, #152214, #151152, #155891, #154578, #151272, #151288, #153997, #151871, #153362, #156566, #150661, #153582)ONNX
Added new strategy
draft_export(#147529, docs) to provide debugging information upon data-dependent / constraint errors when obtaining anExportedProgramwithtorch.onnx.exportAdded support for symbolic operators in the
dynamo=Trueexport path (#148905, #149678, #150038, docs). Two operatorstorch.onnx.ops.symbolicandtorch.onnx.ops.symbolic_multi_outare defined to allow you to create symbolic ONNX operators directly in your PyTorch models. You can use them in aforwardmethod:Python Frontend
Quantization
torch.float4_e2m1fn_x2dtype (#148791)XPU
Improvements
Build Frontend
TORCH_CUDA_ARCH_LIST(#152715, #155314)Composability
C++ Frontend
bicubicmode fortorch::nn::functional::grid_sample(#150817)CUDA
no_implicit_headersmode forload_inline()on custom CUDA extensions (#149480)cuDNN
Distributed
c10d
TCPStorewith clone and queuing features (#150966, #151045, #150969, #151485)getDefaultBackendmore fault tolerant without relying on exceptions (#149152)masterListenFdinTCPStoreLibUvBackend(#150215)TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK(#150682)global_rankwhengroup_rankis used (#151373)ProcessGroupNCCLvia an unsafe API (#152496)needs_contiguous_stridestag in functional collective (#153399, #153523)split_groupto work with non-nccl backends (#152175)new_subgroups()by usingnew_subgroups_by_enumeration()(#153843)ProcessGroupNCCL(#153990)c10::Halffor gloo (#153862)get_process_group_ranks()to acceptgroup=None(#154902)init_process_groupsupport index-only device id (#156214)ProcessGroup(#151723)reduce_scatterandReduceOp::AVGinProcessGroupGloo(#149781, #149869)ProcessGroupNCCL(#152706)ibverbsbackend in gloo and enabled gloo CUDA when used with a backend that supportsGPUDirect(#153015, #153425, #153406)DeviceMesh
DistributedDataParallel (DDP)
use_python_reducerto C++ reducer (#152735)DistributedStateDict(DSD)write_sizein planner write items (#149699)DTensor
StridedShardsupport uneven sharding (#150490)torch.cumsum(#151071)DTensorredistributefwd/bwd datatype conversion to enableSimpleFSDPmixed precision training (#150740)torch.distributed.tensor.debug.visualize_sharding(#152027)FullyShardedDataParallel2 (FSDP2)
PrivateUse1backend in FSDP collectives and device type to pre forward hook (#147260, #149487)set_reshard_after_forward(#149103)reshard_after_forward=Truefor root model and kept root unsharded when not specifyingreshard_after_forward(#154704, #155319)all_reduce_eventonly if it's not CPU device (#150316)Pipeline Parallelism
get_pipeline_order()for Gpipe and 1F1B (#155935)ShardedTensor
ShardedTensorand recalculated metadata fromall_gather(#152583)TensorParallel
ParallelStyle PrepareModuleInputOutput(#150372)torchelastic
torch.compile
Dynamo
__torch_function__, andnamedtuplesubclasses (#153150, #149792, #153982)#151962, #149707, #149709,
#148799, #148801)
reasonfield totorch.compiler.disable(#150341)lru_cachewarnings for functions in the top-leveltorchnamespace (#157718)Inductor
aot_inductor.custom_ops_to_c_shimsandaot_inductor.custom_op_libs: allow for specifying custom op C shim (#153968)max_fusion_buffer_group_pairwise_attempts: limits fusions to specified node distance (#154688)cuda.cutlass_enabled_ops: controls CUTLASS operation selection (#155770)triton.cudagraph_capture_sizes: allows specifying certain shapes for which to capture CUDAGraphs; skips CUDAGraphs for other shapes (#156551)use_static_cuda_launcher: enables launching compiled triton statically to improve cold start times (#148890)assume_unaligned_fallback_output: allows inductor to track unaligned outputs (#150777)cuda.cutlass_tma_only: controls whether or not to only use TMA-compatible kernels in CUTLASS (#152815)static_launch_user_defined_triton_kernels: enables statically launching user defined triton kernels (#153725)precompilation_timeout_seconds: controls the timeout on precompilation (#153788)disable_decompose_k: disables newDecomposeKGEMM Kernels (#154421)min_num_split: sets the minimum number of splits in a split reduction (#155941)max_autotune_flex_search_space: allows specifying the size of the search space for flex attention autotuning (#156307)LOG_AUTOTUNE_RESULTSfor autotune log (#156254)torch.export
min,max,math.pow) (#151348)pytree.register_dataclass(#147752)jit.scripted functions in export (#155180)Ahead-Of-Time Inductor (AOTI)
num_runnerstoAOTIModelPackageLoader(#149364)FX
==(#150611)normalize_function(#143689)graph_code_verbose_logartifact for FX passes (#153775)fx.passes.split_moduleto normalize input names (#157793)Linear Algebra Frontend
cross(#154999)MPS
torch.specialoperations as well asindex_copy,hardshrink,rsub,col2im, andisin(#149174, #149203 #149123, #149368, #149378, #149563, #149687, #149705, #149783, #149407/#149680, #150279, #151754, #153786, #154326, #155304, #156263, #155382, #154010, #149816, #152282, #156090, #150060, #151600, #155002, #154671)index_putwith half precision floats (#151869)ConvTranspose3Dwith FP32 and complex (#154696)log1pandsigmoidwith int64 (#151791)Nested Tensor (NJT)
torch.nn
weight_normon CPU (#148878)ONNX
dynamo=True(#149901, #154596)Attention-23andRotaryEmbedding-23as native PyTorch ops (#156431, #156367, #154745)torch.scan(#154513)group_normsupport from opset 21 (#152138)asdictmethod toVerificationInfoclass (#151024)dynamic_shapesbehavior to usetorch.export.dim.DYNAMIC(#153065)sym_float,sym_not,sym_min,sym_max(#153200, #152111, #152196)Optimizer
TensorLRvariant for fused Adagrad on CPU (#153078)lr_lambdatype check inMultiplicativeLR(#151973)Profiler
Python Frontend
torch.AcceleratorError(#152023)Size.__radd__()(#152554)get_default_device()to also respecttorch.devicecontext manager (#148621)Quantization
mul/add/add_reluandbatch_norm2d), qconv1d-relu fusion, and lowering pass (#151112, #152411, #152811, #150751, #149708)torch.fused_moving_avg_obs_fake_quanton CUDA (#153699)Release Engineering
ROCm
cpp_extension(#152432)mm/bmm/addmm(#153262)Sparse Frontend
PrivateUse1extension (#149374)torch.func
torch.Tensor.scatter_add_(#150543),torch.matrix_exp(#155202)XPU
embed_cubinandmulti_arch_kernel_binaryoptions in AOTI for Intel GPU (#154514, #153924)UserDefineClass(#155787)Bug Fixes
Build Frontend
CMake-4.x(#150203)gcc-12+(#150847)/permissive-flag (#149035)Composability
torch.normfor scalar input (#144073)CPU (x86)
log_softmaxreduced-precision fp kernel (#156379)CUDA
torch.backends.cuda.matmul.allow_fp16_accumulationcrash when using cuBLASLt (#153083)AsyncMMon Blackwell (#153519)torch.cuda.MemPoolfor multithreaded use-cases (#153356)sum()on a default-constructed gamma / beta inlayer_norm(#156600)empty_cacheunder mempool context (#158180)Distributed
c10d
all_to_all(#149485)groupinput argument innew_subgroups()(#152765, #153798)Distributed Checkpointing (DCP)
broadcast_objectutil function (#155912)DistributedDataParallel (DDP)
DDPOptimizerissue on static tensor index (#155746)DTensor
local_mapwith multi-threading (#149070)new_local_tensorinredistributebe None case ([#152303](hConfiguration
📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.