Skip to content

Libdecimal for C++ Quants: What are my options?

What every C++ developer knows: 0.1 + 0.2 != 0.3. No surprise there. The interesting question is which decimal format to use, what it costs you, and where Boost.Decimal stands among the options.

In October 2025, Harold Bott got in touch and asked me to review the dependency-free Boost.Decimal implementation. Unfortunately, I was not in a good position to help at the time: The Intel LIBBID-based libdecimal library I maintained was in terrible shape, and I did not yet have a proper benchmarking or comparison framework ready. I felt a bit ashamed about that. I tried to scramble something together so I could respond with useful content, but their release deadline was only a few days away. It felt like a missed opportunity, and I kept thinking about it from time to time — you know, those moments between doom scrolling and being busy in general. This work is my attempt to make up for it. Boost.Decimal is excellent work supports both Intel BID and IBM DPD formats, but this review is not to repeat the internal work of the library contributors, instead focuses on the corners I am familiar with writing exchange connectivities and trading systems.

1. Tools: Nanobench by Martin Leitner-Ankler a header only for Modern C++

The idea is simple: declare the types we care about, push them all through the same meat grinder, then measure the difference. It is not laboratory-perfect benchmarking, but it is close to the question production systems actually ask: given the same messy workload, which implementation behaves better?

nanobench.cpp
#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.hpp>

namespace bench {
    using import_32_t = std::tuple<
        traits_t<boost::decimal::decimal32_t>,
        bench::ieee754<float, bench::kind_t::naive>,
        bench::ieee754<float, bench::kind_t::optimised>,
        traits_t<math::decimal_t<std::uint32_t>>,
        traits_t<math::scaled::decimal_t<std::uint32_t>>,
        traits_t<math::bcd::decimal_t<std::uint32_t>>,
        bench::fixed<math::fixed::decimal_t<std::uint32_t, -4>, bench::kind_t::naive>,
        bench::fixed<math::fixed::decimal_t<std::uint32_t, -4>, bench::kind_t::optimised>
    >;
[...]
    using type_list = std::tuple<import_32_t, import_64_t>;
[...]
    template<class trait_t>
    void run_import_case(ankerl::nanobench::Bench& bench, std::size_t n) {
        using significand_t = typename trait_t::significand_t;
        const auto input = make_input<significand_t>(n);
        bench.run(trait_t::name().data(), [&] {
            for (std::size_t i = 0; i < n; ++i) {
                auto x = trait_t::make(input[i].significand, input[i].exponent);
                ankerl::nanobench::doNotOptimizeAway(x);
            }
        });
    }

    template<class group> void run_import_group(std::string_view title, std::size_t n) {
        ankerl::nanobench::Bench bench;
        bench.title(title.data()).relative(true)
             .minEpochIterations(50'000).epochs(30);
        static_for<group>([&]<class element_t>() {
            run_import_case<element_t>(bench, n);
        });
    }

Nanobench lived up to its promise: no Google-style complexity, just a small, fast library with cache warmup, useful diagnostics, and even a reminder to pin the process to a CPU to reduce jitter. If only it used snake_case and came with a built-in static_for construct. Nevertheless, I liked the library and happily recommend it!

  • How to invoke
    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -Dlibdecimal_BUILD_BENCHMARKS=ON
    bash bench/run-all.sh           # uses ./build by default
    bash bench/run-all.sh <build-dir>   # or point at a custom build dir
    
  • How to Pin the CPU for stable numbers
    sudo cpupower frequency-set -g performance     # disable scaling (needs root)
    bash bench/run-all.sh
    sudo cpupower frequency-set -g powersave       # restore afterwards
    
  • How to Change the measurement baseline nanobench treats the first bench.run() call in each Bench instance as the 100 % reference. To rebaseline on a different type, move its bench.run() block to the top of the group in the corresponding bench/*.cpp, then rerun. For construct-from-pair.cpp, which iterates a static_for<tuple_t>, the first type in the tuple is the baseline:
    using import_64_t = std::tuple<
        traits_t<boost::decimal::decimal64_t>,   // <-- baseline (first = 100%)
        bench::ieee754<double, bench::kind_t::naive>,
        ... >;
    
2. Decimal Number Representations: What we actually compare here
  • Boost.Decimal
    This is our baseline: a modern C++ implementation of IEEE 754 / ISO/IEC TR 24733 decimal floating-point types. The library is header-only, dependency-free, and requires C++14. Internally, BID-style decimal floating point is compact, but arithmetic and comparison are more involved than plain integer operations. The value of the library is that it hides this complexity behind a clean and standard-looking C++ interface.

  • BID — libdecimal / Intel BID
    This is also an IEEE 754 decimal floating-point implementation, based on the reworked Intel decimal library and wrapped with a modern C++ interface. Compared with the original Intel distribution, libdecimal adds cross-platform build work, 128-bit support, custom parsing, import/export helpers, and high-performance conversion routines. The storage model is the same broad family as above: decimal floating point using a BID-style representation.

  • BCD — binary-coded decimal
    BCD is the classic representation used in many legacy business systems. The idea is simple: each base-10 digit is stored in four bits, so two decimal digits can be packed into one byte. This is easy to inspect and faithful to decimal notation, but it carries a significant cost on modern binary hardware. The CPU is very good at binary integer arithmetic; BCD makes it put on gloves first.

  • Scaled decimal — coefficient/exponent pair
    The scaled representation stores a decimal as two separate fields: a significand and an exponent. In mathematical form: \(x = m \cdot 10^e\) This format appears often in internal trading systems and wire protocols, especially when different programming languages need to exchange decimal values without committing to one fixed scale. It is flexible, easy to serialize, and cheap to inspect, but addition and comparison require exponent alignment.

  • Fixed-scale decimal — scaled integer with compile-time exponent
    The fixed representation stores the value as a single integer significand, while the decimal exponent is tracked as a compile-time constant. In mathematical form:\(x = m \cdot 10^e\) but here \(e\) belongs to the type, not to the object. This is usually the fastest and simplest decimal representation when the scale is known in advance. It is well suited for fixed-scale financial values, columnar storage, market data, deterministic arithmetic, compact binary layouts, and schema-driven decimal fields. The downside is reduced generality. Fixed-scale decimal is excellent for addition, subtraction, comparison, and storage, but multiplication and division need explicit rescaling and rounding policy. BID-style decimal floating point is more robust for mixed-scale arithmetic and general-purpose decimal work.

3. Results: What certain operations cost you relative to Boost.Decimal

Construct from (significand, exponent)

relative ns/op type
100.0 % 3 934 boost decimal32 — baseline
76.5 % 5 142 float naive
509 % 772 float optimised
454 % 866 intel bid32
326 % 1 208 scaled uint32
75.2 % 5 228 bcd uint32
480 % 820 fixed32 checked
484 % 814 fixed32 unchecked
relative ns/op type
100.0 % 981 boost decimal64 — baseline
8.9 % 11 047 double naive
130 % 757 double optimised
106 % 926 intel bid64
89.4 % 1 096 scaled uint64
13.7 % 7 132 bcd uint64
137 % 718 fixed64 checked
158 % 621 fixed64 unchecked

Fixed-exponent types dominate at both widths. BCD construction is 7–12× slower than boost.

Decompose to (significand, exponent)

relative ns/op type
100.0 % 61 368 boost decimal32 frexp10 — baseline
671 % 9 152 float utils::decompose
855 % 7 181 intel bid32 bid::decompose
1781 % 3 446 scaled uint32 as_pair
relative ns/op type
100.0 % 24 250 boost decimal64 frexp10 — baseline
265 % 9 147 double utils::decompose
361 % 6 727 intel bid64 bid::decompose
610 % 3 974 scaled int64 as_pair

boost::decimal::frexp10 is the most expensive decompose path. scaled::as_pair() is a direct field read — 18× faster at 32-bit, 6× faster at 64-bit. Use libdecimal types for any path that must inspect the representation (serialization, wire encoding, logging).

Compare

relative ns/op type
100.0 % 60 017 boost decimal64 — baseline
385 % 15 611 double
119 % 50 563 intel bid64
276 % 21 771 fixed64 -4
294 % 20 431 scaled int64
relative ns/op type
100.0 % 56 215 boost decimal64 — baseline
1240 % 4 535 double
111 % 50 457 intel bid64
454 % 12 370 fixed64 -4
237 % 23 721 scaled int64

Boost comparison is the slowest decimal option. fixed64 -4 sorts 4.5× faster; boost comparison is nearly identical to intel BID64.

Arithmetic

relative ns/op type
100.0 % 190 245 boost decimal64 — baseline
2378 % 8 002 double
373 % 51 030 intel bid64
14.9 % 1 278 906 bcd64
1886 % 10 090 fixed64 -4
632 % 30 082 scaled int64
relative ns/op type
100.0 % 5 457 boost decimal64 — baseline
3439 % 159 double
264 % 2 070 intel bid64
4.6 % 118 188 bcd64
654 % 834 fixed64 -4
530 % 1 030 scaled int64

Boost arithmetic is the slowest decimal path. fixed64 is 19× faster for accumulation. BCD should not be used for arithmetic — it is 7–20× slower than boost.

String parse and format

relative ns/op type
100.0 % 44 106 boost decimal64 — baseline
315 % 14 008 intel bid64
21.5 % 205 445 bcd64 (digit-string + exp)
94.7 % 46 559 stod (double reference)
relative ns/op type
100.0 % 131 586 boost decimal64 — baseline
293 % 44 877 intel bid64 (std::format)
238 % 55 215 bcd64 (.str())
827 % 15 919 scaled int64 (cast to string)
66.3 % 198 355 double (to_string)

Intel BID64 parses 3× faster than boost. For formatting, scaled's cast-to-string is 8× faster than boost's ostringstream <<.

Fee calculation — notional × rate

relative ns/op type
100.0 % 4 607 boost decimal64 — baseline
569 % 810 double
124 % 3 730 intel bid64
176 % 2 617 fixed64 -4
273 % 1 688 scaled int64

Multiply is boost's least-bad category; intel is only 1.24× faster. Scaled is 2.7× faster.

Risk limit — accumulate + clamp (10 000 values)

relative ns/op type
100.0 % 219 475 boost decimal64 — baseline
1370 % 16 017 double
279 % 78 644 intel bid64
612 % 35 873 fixed64 -4
227 % 96 909 scaled int64

Under realistic mixed-op pressure (add + branch + assign per step), boost is still the slowest decimal type.

"Environment"

  • CPU: 11th Gen Intel Core i7-11700K @ 3.60 GHz · OS: Linux 5.15 x86_64
  • Build: -O3 -march=native, C++23
  • Baseline: boost::decimal = 100 %; numbers above 100 % are faster than boost.
  • CPU frequency scaling was active (powersave governor + turbo); figures are indicative, not lab-grade.

Speedup relative to boost::decimal (>1× = faster). Intel Core i7-11700K, -O3 -march=native, C++23.

type construct decompose compare accumulate multiply parse format
double 1.3× 2.7× 3.9× 23.8× 5.7× 0.95× 0.66×
intel bid64 1.1× 3.6× 1.2× 3.7× 1.2× 3.2× 2.9×
fixed64 -4 1.6× 2.8× 18.9× 1.8×
scaled int64 0.9× 6.1× 2.9× 6.3× 2.7× 8.3×
bcd64 0.1× 0.15× 0.05× 0.2× 2.4×

From the numbers, Boost.Decimal is on the money, but there is still room for improvement. To be fair, libdecimal by VargaLABS is not a copy-paste distribution of the original Intel library with a thin C++ layer glued on top. Besides the boring gymnastics required to make the Intel code work cleanly across platforms, libdecimal adds a custom parser, import/export support, and display functions, together with the slim and pleasant C++ syntax you may recognize from H5CPP.

Returning to Boost.Decimal: it is a good, modern, header-only implementation. If you already use Boost, it will not add significant friction to the pipeline, and it can definitely lower development cost.

use case recommended type reason
Arithmetic — PnL, accumulation fixed64<-4> 7–19× faster; compile-time exponent eliminates runtime normalization
Multiply hot path — fees, notional scaled<int64_t> 2.7× faster; avoids forced fixed-scale rescaling on multiply
Compare / sort / rank fixed64<-4> 2.8–4.5× faster
Wire encode scaled<int64_t> field read, near-free
Wire decode fixed64<-4> 3× faster; avoids per-value exponent normalization
String parse intel bid64 3.2× faster than Boost.Decimal
String format scaled<int64_t> 8.3× faster when coefficient/exponent formatting is sufficient
Inspect internals / serialize scaled<int64_t> 6–18× faster via direct as_pair() access
General purpose / mixed scale intel bid64 competitive across most operations, without a fixed-exponent constraint
Avoid for arithmetic hot paths bcd64 7–20× slower than Boost.Decimal across arithmetic operations

The takeaway: decimal floating point is powerful, but it is not something to spread over the whole system like Marmite on toast. Use it where mixed scale, interchange, or standards compatibility matter. For hot arithmetic paths, especially in trading-style workloads, fixed-scale and scaled-integer representations may perform better.