Let's say you need to sum a large array of floating-point numbers, maybe because you're calculating the arithmetic mean or variance. This post shows how to do so accurately and quickly.
Here's what the accuracy-naïve, performance-naïve implementation looks like (Godbolt):
When summing floating-point numbers, we need to consider the following:
- Overflow Adding two numbers that have the value
FLT_MAX(maximum value of
float, ~3.4e38) using a single-precision
floataccumulator will obviously overflow. To work around this, we can simply use a double-precision accumulator and be able to safely sum
DBL_MAX ÷ FLT_MAX ≈ 1.8e308 ÷ 3.4e38 ≈ 5.2e269floats. That's a lot—more than the max
(If you're summing a huge number of huge double-precision numbers, using
long doubleis sometimes a possibility. On x86 that usually uses the legacy 80-bit x87 float-point unit, which is incapable of SIMD operations and has half the throughput of the SSE FPU on recent Intel and AMD CPUs.)
- Floating-point error Adding two numbers of very different magnitude causes loss of significant digits. Using a double-precision accumulator helps, but we can also use compensated summation to limit the error to be essentially independent of the number of values we're adding. Kahan summation is the most famous algorithm; it's simple and effective. Neumaier's improved version ("Kahan-Babuška Algorithm") is correct in some cases where Kahan is not, but is significantly slower. Pairwise summation is also popular and requires the fewest operations, but has somewhat worse error and needs to be implemented with careful concern for data locality (cache-awareness).
Here's a (more) accurate implementation that uses Kahan summation (Godbolt):
We want to sum the numbers quickly. Here's what we'll do:
- Break the loop-carried dependency. Modern CPUs can start a new operation before the previous one completes. This is called instruction pipelining. On Intel's Skylake architecture, two additions can be started per cycle ("throughput") despite taking four cycles to complete ("latency"). The equivalent numbers for AMD's Ryzen are one and three.
This also effectively unrolls the loop, which itself can improve performance.
- Use SIMD. Kahan and Neumaier summation can be trivially parallelized to operate on four (AVX) or eight (AVX-512) doubles at a time.
-ffast-math (which allows reordering of floating-point operations) does both of these optimizations automatically, although it only uses four vectors of accumulators (not quite enough for max speed). GCC auto-vectorizes but doesn't unroll or use multiple accumulators. However,
-ffast-math is unsafe to use with compensated summation because the compiler can and does remove operations that are algebrically void.
Below are the benchmark results. We get a final speed up of ~8x over the naïve version (along with better floating-point accuracy) and a ~32x speedup over scalar by optimizing the Kahan algorithm.
Here are the final AVX(+BMI2) and AVX-512 implementations. (This code uses Intel intrinsics, which are documented here.)
The actual code used to benchmark (harder to read but easier to tinker with) is located here. It includes AVX and AVX-512 Neumaier implementations not shown above.If you have any comments, feel free to email or tweet using the links in the footer, or open an issue in my GitHub pages repo. Corrections and improvements greatly appreciated.