迷你浮点数

迷你浮点minifloats)是用少位元浮点数值。不太适合通用数值计算。通常用于特殊目的,像电脑图形学,迭代很小并且精度具有美学效果。[1]机器学习也使用类似格式,如bfloat16

Minifloats按照IEEE 754标准设计。必须遵守次正规数和正规数之间边界规则(未明确写入的),且具无限大和 NaN 特殊模式。标准化数字以有偏差的指数储存。该标准的新修订版 IEEE 754-2008 具有 16 位元二进制小型浮点数。

符号

迷你浮点通常使用四个数字的元组(S、E、M、B)来描述:

  • S是符号栏位的长度。通常为 0 或 1。
  • E是指数栏位的长度。
  • M是尾数(有效数字)栏位的长度。
  • B是指数偏差。

因此,以 (S, E, M, B) 表示的小型浮点格式为 S + E + M 位元。 (S, E, M, B) 表示法可以转换为 (B, P, L, U) 格式,如 (2, M + 1, B + 1, 2 S − B) (IEEE指数)。

例子

8位元 minifloat 范例(1.4.3)
符号 指数 有效数
0 0 0 0 0 0 0 0

具有1个符号位元、4个指数位元和3个有效位元[2][3]对于大多数值指数x是2x−7。所有IEEE 754原则都应该有效。 [4]


零的表示

0 0000 000 = 0
1 0000 000 = −0

次正规数

有效数用0.扩展:

0 0000 001 = 0.0012 × 21 - 7 = 0.125 × 2-6 = 0.001953125 (最小次正規數)
...
0 0000 111 = 0.1112 × 21 - 7 = 0.875 × 2-6 = 0.013671875 (最大次正規數)

标准化数字

有效数用1.扩展:

0 0001 000 = 1.0002 × 21 - 7 = 1 × 2-6 = 0.015625 (least normalized number)
0 0001 001 = 1.0012 × 21 - 7 = 1.125 × 2-6 = 0.017578125
...
0 0111 000 = 1.0002 × 27 - 7 = 1 × 20 = 1
0 0111 001 = 1.0012 × 27 - 7 = 1.125 × 20 = 1.125 (最小值大於 1)
...
0 1110 000 = 1.0002 × 214 - 7 =  1.000 × 27 =  128
0 1110 001 = 1.0012 × 214 - 7 =  1.125 × 27 =  144
...
0 1110 110 = 1.1102 × 214 - 7 =  1.750 × 27 = 224
0 1110 111 = 1.1112 × 214 - 7 =  1.875 × 27 = 240 (最大標準數)

无穷

0 1111 000 = +∞
1 1111 000 = −∞

非数

s 1111 mmm = NaN (if mmm ≠ 000)

数值表

这是此范例 8 位元浮点的所有可能值的图表。

… 000 … 001 … 010 … 011 … 100 … 101 … 110 … 111
0 0000 … 0 0.001953125 0.00390625 0.005859375 0.0078125 0.009765625 0.01171875 0.013671875
0 0001 … 0.015625 0.017578125 0.01953125 0.021484375 0.0234375 0.025390625 0.02734375 0.029296875
0 0010 … 0.03125 0.03515625 0.0390625 0.04296875 0.046875 0.05078125 0.0546875 0.05859375
0 0011 … 0.0625 0.0703125 0.078125 0.0859375 0.09375 0.1015625 0.109375 0.1171875
0 0100 … 0.125 0.140625 0.15625 0.171875 0.1875 0.203125 0.21875 0.234375
0 0101 … 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875
0 0110 … 0.5 0.5625 0.625 0.6875 0.75 0.8125 0.875 0.9375
0 0111 … 1 1.125 1.25 1.375 1.5 1.625 1.75 1.875
0 1000 … 2 2.25 2.5 2.75 3 3.25 3.5 3.75
0 1001 … 4 4.5 5 5.5 6 6.5 7 7.5
0 1010 … 8 9 10 11 12 13 14 15
0 1011 … 16 18 20 22 24 26 28 30
0 1100 … 32 36 40 44 48 52 56 60
0 1101 … 64 72 80 88 96 104 112 120
0 1110 … 128 144 160 176 192 208 224 240
0 1111 … NaN NaN NaN NaN NaN NaN NaN
1 0000 … −0 −0.001953125 −0.00390625 −0.005859375 −0.0078125 −0.009765625 −0.01171875 −0.013671875
1 0001 … −0.015625 −0.017578125 −0.01953125 −0.021484375 −0.0234375 −0.025390625 −0.02734375 −0.029296875
1 0010 … −0.03125 −0.03515625 −0.0390625 −0.04296875 −0.046875 −0.05078125 −0.0546875 −0.05859375
1 0011 … −0.0625 −0.0703125 −0.078125 −0.0859375 −0.09375 −0.1015625 −0.109375 −0.1171875
1 0100 … −0.125 −0.140625 −0.15625 −0.171875 −0.1875 −0.203125 −0.21875 −0.234375
1 0101 … −0.25 −0.28125 −0.3125 −0.34375 −0.375 −0.40625 −0.4375 −0.46875
1 0110 … −0.5 −0.5625 −0.625 −0.6875 −0.75 −0.8125 −0.875 −0.9375
1 0111 … −1 −1.125 −1.25 −1.375 −1.5 −1.625 −1.75 −1.875
1 1000 … −2 −2.25 −2.5 −2.75 −3 −3.25 −3.5 −3.75
1 1001 … −4 −4.5 −5 −5.5 −6 −6.5 −7 −7.5
1 1010 … −8 −9 −10 −11 −12 −13 −14 −15
1 1011 … −16 −18 −20 −22 −24 −26 −28 −30
1 1100 … −32 −36 −40 −44 −48 −52 −56 −60
1 1101 … −64 −72 −80 −88 −96 −104 −112 −120
1 1110 … −128 −144 −160 −176 −192 −208 −224 −240
1 1111 … −∞ NaN NaN NaN NaN NaN NaN NaN

只有 242 个不同的非 NaN 值(如果 +0 和 -0 视为不同),因为 14 个位元模式代表 NaN。

可以使用脚本为 SEMB 值的任意组合产生如上所述的表格PythonGDScript.

其它偏差值

在这些小尺寸下,其它偏差值可能会很有趣,例如 -2 的偏差将使数字 0-16 具有与整数 0-16 相同的位表示形式,但会导致无法表示非整数值。

0 0000 000 = 0.0002 × 21 - (-2) = 0.0 × 23 = 0 (subnormal number)
0 0000 001 = 0.0012 × 21 - (-2) = 0.125 × 23 = 1 (subnormal number)
0 0000 111 = 0.1112 × 21 - (-2) = 0.875 × 23 = 7 (subnormal number)
0 0001 000 = 1.0002 × 21 - (-2) = 1.000 × 23 = 8 (normalized number)
0 0001 111 = 1.1112 × 21 - (-2) = 1.875 × 23 = 15 (normalized number)
0 0010 000 = 1.0002 × 22 - (-2) = 1.000 × 24 = 16 (normalized number)

Arithmetic

Addition

 
Addition of (1.3.2.3)-minifloats

此图示范了增加较小的 (1.3.2.3)-6 位元迷你浮点。

此浮点系统完全遵循IEEE 754规则。

NaN作为算子始终产生NaN结果。

∞−∞和 (−∞) +∞会产生 NaN(绿)。∞可以按有限值增减而不会发生变化。

有限操作数的和可以给出无限结果(即 14.0 + 3.0 = +∞,因为结果是青,-∞红)。


算术运算可以类似地说明:

已隐藏部分未翻译内容,欢迎参与翻译

Other sizes

The Radeon R300 and R420 GPUs used an "fp24" floating-point format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa.[5] "Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 (Shader Model 2.0) graphics API initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.

Khronos defines 10-bit and 11-bit float formats for use with Vulkan. Both formats have no sign bit and a 5-bit exponent. The 10-bit format has a 5-bit mantissa, and the 11-bit format has a 6-bit mantissa.[6][7]

4 bits and fewer

The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.[8] In the table below, the columns have different values for the sign and mantissa bits, and the rows are different values for the exponent bits.

0 … 0 0 … 1 1 … 0 1 … 1
… 00 … 0 0.5 −0 −0.5
… 01 … 1 1.5 −1 −1.5
… 10 … 2 3 −2 −3
… 11 … NaN −∞ NaN

If normalized numbers are not required, the size can be reduced to 3-bit by reducing the exponent down to 1.

0 … 0 0 … 1 1 … 0 1 … 1
… 0 … 0 1 −0 −1
… 1 … NaN −∞ NaN

In situations where the sign bit can be excluded, each of the above examples can be reduced by 1 bit further, keeping only the left half of the above tables. A 2-bit float with 1-bit exponent and 1-bit mantissa would only have 0, 1, Inf, NaN values.

If the mantissa is allowed to be 0-bit, a 1-bit float format would have a 1-bit exponent, and the only two values would be 0 and Inf. The exponent must be at least 1 bit or else it no longer makes sense as a float (it would just be a signed number).

In embedded devices

Minifloats are also commonly used in embedded devices,[来源请求] especially on microcontrollers where floating-point will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting.

参见

参考

  1. ^ Mocerino, Luca; Calimera, Andrea. AxP: A HW-SW Co-Design Pipeline for Energy-Efficient Approximated ConvNets via Associative Matching. Applied Sciences. 24 November 2021, 11 (23): 11164. doi:10.3390/app112311164 . 
  2. ^ IEEE half-precision has 5 exponent bits with bias 15 ( ), IEEE single-precision has 8 exponent bits with bias 127 ( ), IEEE double-precision has 11 exponent bits with bias 1023 ( ), and IEEE quadruple-precision has 15 exponent bits with bias 16383 ( ). See the Exponent bias article for more detail.
  3. ^ O'Hallaron, David R.; Bryant, Randal E. Computer systems: a programmer's perspective 2. Boston, Massachusetts, USA: Prentice Hall. 2010. ISBN 978-0-13-610804-7. 
  4. ^ Burch, Carl. Floating-point representation. Hendrix College. [2023-08-29]. (原始内容存档于2024-11-29). 
  5. ^ Buck, Ian, Chapter 32. Taking the Plunge into GPU Computing, Pharr, Matt (编), GPU Gems, 2005-03-13 [2018-04-05], ISBN 0-321-33559-7, (原始内容存档于2018-06-12) .
  6. ^ Garrard, Andrew. 10.3. Unsigned 10-bit floating-point numbers. Khronos Data Format Specification v1.2 rev 1. Khronos Group. [2023-08-10]. (原始内容存档于2021-05-18). 
  7. ^ Garrard, Andrew. 10.2. Unsigned 11-bit floating-point numbers. Khronos Data Format Specification v1.2 rev 1. Khronos Group. [2023-08-10]. (原始内容存档于2021-05-18). 
  8. ^ Shaneyfelt, Dr. Ted. Dr. Shaneyfelt's Floating Point Consruction Gizmo. Dr. Ted Shaneyfelt. [2023-08-29]. (原始内容存档于2023-09-22). 

外部链接