FP8 data type - all values in a table

submited by
Style Pass
2024-09-25 02:30:06

Floating-point numbers are a great invention. Thanks to dedicating separate bits to the sign, exponent, and mantissa (also called significand), they can represent a wide range of numbers on a limited number of bits - numbers that are positive or negative, very large or very small (close to zero), integer or fractional.

In programming, we typically use double-precision (64b) or single-precision (32b) numbers. These are the data types available in programming languages (like double and float in C/C++) and supported by processors, which can perform calculations on them efficiently. Those of you who deal with graphics programming using graphics APIs like OpenGL, DirectX, or Vulkan, may know that some GPUs also support 16-bit floating-point type, also known as half-float. For example, HLSL (the shader language used with DirectX) defines type min16float with 16b of minimum precision since Windows 8, and explicit 16b type float16_t added in Shader Model 6.2. See also "Scalar data types" in the HLSL reference documentation.

Such 16b "half" type obviously has limited precision and range compared to the "single" or "double" version. I summarized capabilities and limits of these 3 types in a table in my old "Floating-Point Formats Cheatsheet". Because of these limitations, using a half-float instead of the standard float may not work correctly in all cases. For example, it may be enough to represent RGB components of a color (even when using HDR), or a normal vector that points to a direction, but it won't be sufficient to accurately represent a position in 3D space. It is also easy to exceed its maximum range, e.g. when calculating a dot product of two vectors. I planned to write an entire article about advantages and pitfalls of using half-float numbers in shaders, but I never did it.

Leave a Comment