TensorFloat-32

Numbering format in Nvidia hardware

This article needs additional citations for verification . Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "TensorFloat-32" – news · newspapers · books · scholar · JSTOR (April 2025) (Learn how and when to remove this message)

Floating-point formats
IEEE 754
16-bit: Half (binary16) 32-bit: Single (binary32), decimal32 64-bit: Double (binary64), decimal64 128-bit: Quadruple (binary128), decimal128 256-bit: Octuple (binary256) Extended precision
Other
Minifloat bfloat16 TensorFloat-32 Microsoft Binary Format IBM hexadecimal floating-point PMBus Linear-11 G.711 8-bit floats
Alternatives
Arbitrary precision Block floating point
Tapered floating point
Posit
v t e

TensorFloat-32 (TF32) is a numeric floating point format designed for Tensor Core running on certain Nvidia GPUs.

Format

[edit ]

The binary format is:

1 sign bit
8 exponent bits
10 significand bits (also called mantissa, or precision bits)

The total 19-bit format fits within a double word (32 bits), and while it lacks precision compared with a normal 32-bit IEEE 754 floating-point number, provides much faster computation, up to 8 times on a A100 (compared to a V100 using FP32).^[1]

References

[edit ]

^ https://deeprec.readthedocs.io/en/latest/NVIDIA-TF32.html accessed 23 May 2024

Retrieved from "https://en.wikipedia.org/w/index.php?title=TensorFloat-32&oldid=1285621065"

Format

See also

References