Libdecimal for C++ Quants: What are my options?

What every C++ developer knows: 0.1 + 0.2 != 0.3. No surprise there. The interesting question is which decimal format to use, what it costs you, and where Boost.Decimal stands among the options.

In October 2025, Harold Bott got in touch and asked me to review the dependency-free Boost.Decimal implementation. Unfortunately, I was not in a good position to help at the time: The Intel LIBBID-based libdecimal library I maintained was in terrible shape, and I did not yet have a proper benchmarking or comparison framework ready. I felt a bit ashamed about that. I tried to scramble something together so I could respond with useful content, but their release deadline was only a few days away. It felt like a missed opportunity, and I kept thinking about it from time to time — you know, those moments between doom scrolling and being busy in general. This work is my attempt to make up for it. Boost.Decimal is excellent work supports both Intel BID and IBM DPD formats, but this review is not to repeat the internal work of the library contributors, instead focuses on the corners I am familiar with writing exchange connectivities and trading systems.

1. Tools: Nanobench by Martin Leitner-Ankler a header only for Modern C++

The idea is simple: declare the types we care about, push them all through the same meat grinder, then measure the difference. It is not laboratory-perfect benchmarking, but it is close to the question production systems actually ask: given the same messy workload, which implementation behaves better?

nanobench.cpp
#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.hpp>

namespace bench {
    using import_32_t = std::tuple<
        traits_t<boost::decimal::decimal32_t>,
        bench::ieee754<float, bench::kind_t::naive>,
        bench::ieee754<float, bench::kind_t::optimised>,
        traits_t<math::decimal_t<std::uint32_t>>,
        traits_t<math::scaled::decimal_t<std::uint32_t>>,
        traits_t<math::bcd::decimal_t<std::uint32_t>>,
        bench::fixed<math::fixed::decimal_t<std::uint32_t, -4>, bench::kind_t::naive>,
        bench::fixed<math::fixed::decimal_t<std::uint32_t, -4>, bench::kind_t::optimised>
    >;
[...]
    using type_list = std::tuple<import_32_t, import_64_t>;
[...]
    template<class trait_t>
    void run_import_case(ankerl::nanobench::Bench& bench, std::size_t n) {
        using significand_t = typename trait_t::significand_t;
        const auto input = make_input<significand_t>(n);
        bench.run(trait_t::name().data(), [&] {
            for (std::size_t i = 0; i < n; ++i) {
                auto x = trait_t::make(input[i].significand, input[i].exponent);
                ankerl::nanobench::doNotOptimizeAway(x);
            }
        });
    }

    template<class group> void run_import_group(std::string_view title, std::size_t n) {
        ankerl::nanobench::Bench bench;
        bench.title(title.data()).relative(true)
             .minEpochIterations(50'000).epochs(30);
        static_for<group>([&]<class element_t>() {
            run_import_case<element_t>(bench, n);
        });
    }

Nanobench lived up to its promise: no Google-style complexity, just a small, fast library with cache warmup, useful diagnostics, and even a reminder to pin the process to a CPU to reduce jitter. If only it used snake_case and came with a built-in static_for construct. Nevertheless, I liked the library and happily recommend it!

How to invoke

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -Dlibdecimal_BUILD_BENCHMARKS=ON
bash bench/run-all.sh           # uses ./build by default
bash bench/run-all.sh <build-dir>   # or point at a custom build dir

How to Pin the CPU for stable numbers

sudo cpupower frequency-set -g performance     # disable scaling (needs root)
bash bench/run-all.sh
sudo cpupower frequency-set -g powersave       # restore afterwards

How to Change the measurement baseline nanobench treats the first bench.run() call in each Bench instance as the 100 % reference. To rebaseline on a different type, move its bench.run() block to the top of the group in the corresponding bench/*.cpp, then rerun. For construct-from-pair.cpp, which iterates a static_for<tuple_t>, the first type in the tuple is the baseline:
```
using import_64_t = std::tuple<
    traits_t<boost::decimal::decimal64_t>,   // <-- baseline (first = 100%)
    bench::ieee754<double, bench::kind_t::naive>,
    ... >;
```

2. Decimal Number Representations: What we actually compare here

Boost.Decimal
This is our baseline: a modern C++ implementation of IEEE 754 / ISO/IEC TR 24733 decimal floating-point types. The library is header-only, dependency-free, and requires C++14. Internally, BID-style decimal floating point is compact, but arithmetic and comparison are more involved than plain integer operations. The value of the library is that it hides this complexity behind a clean and standard-looking C++ interface.
BID — libdecimal / Intel BID
This is also an IEEE 754 decimal floating-point implementation, based on the reworked Intel decimal library and wrapped with a modern C++ interface. Compared with the original Intel distribution, libdecimal adds cross-platform build work, 128-bit support, custom parsing, import/export helpers, and high-performance conversion routines. The storage model is the same broad family as above: decimal floating point using a BID-style representation.
BCD — binary-coded decimal
BCD is the classic representation used in many legacy business systems. The idea is simple: each base-10 digit is stored in four bits, so two decimal digits can be packed into one byte. This is easy to inspect and faithful to decimal notation, but it carries a significant cost on modern binary hardware. The CPU is very good at binary integer arithmetic; BCD makes it put on gloves first.
Scaled decimal — coefficient/exponent pair
The scaled representation stores a decimal as two separate fields: a significand and an exponent. In mathematical form: \(x = m \cdot 10^e\) This format appears often in internal trading systems and wire protocols, especially when different programming languages need to exchange decimal values without committing to one fixed scale. It is flexible, easy to serialize, and cheap to inspect, but addition and comparison require exponent alignment.
Fixed-scale decimal — scaled integer with compile-time exponent
The fixed representation stores the value as a single integer significand, while the decimal exponent is tracked as a compile-time constant. In mathematical form:\(x = m \cdot 10^e\) but here \(e\) belongs to the type, not to the object. This is usually the fastest and simplest decimal representation when the scale is known in advance. It is well suited for fixed-scale financial values, columnar storage, market data, deterministic arithmetic, compact binary layouts, and schema-driven decimal fields. The downside is reduced generality. Fixed-scale decimal is excellent for addition, subtraction, comparison, and storage, but multiplication and division need explicit rescaling and rounding policy. BID-style decimal floating point is more robust for mixed-scale arithmetic and general-purpose decimal work.

3. Results: What certain operations cost you relative to Boost.Decimal

Construct from (significand, exponent)

32-bit64-bit

relative	ns/op	type
100.0 %	3 934	`boost decimal32` — baseline
76.5 %	5 142	`float` naive
509 %	772	`float` optimised
454 %	866	`intel bid32`
326 %	1 208	`scaled uint32`
75.2 %	5 228	`bcd uint32`
480 %	820	`fixed32` checked
484 %	814	`fixed32` unchecked

relative	ns/op	type
100.0 %	981	`boost decimal64` — baseline
8.9 %	11 047	`double` naive
130 %	757	`double` optimised
106 %	926	`intel bid64`
89.4 %	1 096	`scaled uint64`
13.7 %	7 132	`bcd uint64`
137 %	718	`fixed64` checked
158 %	621	`fixed64` unchecked

Fixed-exponent types dominate at both widths. BCD construction is 7–12× slower than boost.

Decompose to (significand, exponent)

32-bit64-bit

relative	ns/op	type
100.0 %	61 368	`boost decimal32 frexp10` — baseline
671 %	9 152	`float utils::decompose`
855 %	7 181	`intel bid32 bid::decompose`
1781 %	3 446	`scaled uint32 as_pair`

relative	ns/op	type
100.0 %	24 250	`boost decimal64 frexp10` — baseline
265 %	9 147	`double utils::decompose`
361 %	6 727	`intel bid64 bid::decompose`
610 %	3 974	`scaled int64 as_pair`

boost::decimal::frexp10 is the most expensive decompose path. scaled::as_pair() is a direct field read — 18× faster at 32-bit, 6× faster at 64-bit. Use libdecimal types for any path that must inspect the representation (serialization, wire encoding, logging).

Compare

Micro (10 000 adjacent pairs)Sort (1 000 values)

relative	ns/op	type
100.0 %	60 017	`boost decimal64` — baseline
385 %	15 611	`double`
119 %	50 563	`intel bid64`
276 %	21 771	`fixed64 -4`
294 %	20 431	`scaled int64`

relative	ns/op	type
100.0 %	56 215	`boost decimal64` — baseline
1240 %	4 535	`double`
111 %	50 457	`intel bid64`
454 %	12 370	`fixed64 -4`
237 %	23 721	`scaled int64`

Boost comparison is the slowest decimal option. fixed64 -4 sorts 4.5× faster; boost comparison is nearly identical to intel BID64.

Arithmetic

Accumulate (10 000 values)Dot product (256 pairs)

relative	ns/op	type
100.0 %	190 245	`boost decimal64` — baseline
2378 %	8 002	`double`
373 %	51 030	`intel bid64`
14.9 %	1 278 906	`bcd64`
1886 %	10 090	`fixed64 -4`
632 %	30 082	`scaled int64`

relative	ns/op	type
100.0 %	5 457	`boost decimal64` — baseline
3439 %	159	`double`
264 %	2 070	`intel bid64`
4.6 %	118 188	`bcd64`
654 %	834	`fixed64 -4`
530 %	1 030	`scaled int64`

Boost arithmetic is the slowest decimal path. fixed64 is 19× faster for accumulation. BCD should not be used for arithmetic — it is 7–20× slower than boost.

String parse and format

Parse (1 000 values)Format (1 000 values)

relative	ns/op	type
100.0 %	44 106	`boost decimal64` — baseline
315 %	14 008	`intel bid64`
21.5 %	205 445	`bcd64` (digit-string + exp)
94.7 %	46 559	`stod` (double reference)

relative	ns/op	type
100.0 %	131 586	`boost decimal64` — baseline
293 %	44 877	`intel bid64` (`std::format`)
238 %	55 215	`bcd64` (`.str()`)
827 %	15 919	`scaled int64` (cast to string)
66.3 %	198 355	`double` (`to_string`)

Intel BID64 parses 3× faster than boost. For formatting, scaled's cast-to-string is 8× faster than boost's ostringstream <<.

Fee calculation — `notional × rate`

relative	ns/op	type
100.0 %	4 607	`boost decimal64` — baseline
569 %	810	`double`
124 %	3 730	`intel bid64`
176 %	2 617	`fixed64 -4`
273 %	1 688	`scaled int64`

Multiply is boost's least-bad category; intel is only 1.24× faster. Scaled is 2.7× faster.

Risk limit — accumulate + clamp (10 000 values)

relative	ns/op	type
100.0 %	219 475	`boost decimal64` — baseline
1370 %	16 017	`double`
279 %	78 644	`intel bid64`
612 %	35 873	`fixed64 -4`
227 %	96 909	`scaled int64`

Under realistic mixed-op pressure (add + branch + assign per step), boost is still the slowest decimal type.

"Environment"

CPU: 11th Gen Intel Core i7-11700K @ 3.60 GHz · OS: Linux 5.15 x86_64
Build: -O3 -march=native, C++23
Baseline: boost::decimal = 100 %; numbers above 100 % are faster than boost.
CPU frequency scaling was active (powersave governor + turbo); figures are indicative, not lab-grade.

Speedup relative to boost::decimal (>1× = faster). Intel Core i7-11700K, -O3 -march=native, C++23.

type	construct	decompose	compare	accumulate	multiply	parse	format
`double`	1.3×	2.7×	3.9×	23.8×	5.7×	0.95×	0.66×
`intel bid64`	1.1×	3.6×	1.2×	3.7×	1.2×	3.2×	2.9×
`fixed64 -4`	1.6×	—	2.8×	18.9×	1.8×	—	—
`scaled int64`	0.9×	6.1×	2.9×	6.3×	2.7×	—	8.3×
`bcd64`	0.1×	—	—	0.15×	0.05×	0.2×	2.4×

From the numbers, Boost.Decimal is on the money, but there is still room for improvement. To be fair, libdecimal by VargaLABS is not a copy-paste distribution of the original Intel library with a thin C++ layer glued on top. Besides the boring gymnastics required to make the Intel code work cleanly across platforms, libdecimal adds a custom parser, import/export support, and display functions, together with the slim and pleasant C++ syntax you may recognize from H5CPP.

Returning to Boost.Decimal: it is a good, modern, header-only implementation. If you already use Boost, it will not add significant friction to the pipeline, and it can definitely lower development cost.

use case	recommended type	reason
Arithmetic — PnL, accumulation	`fixed64<-4>`	7–19× faster; compile-time exponent eliminates runtime normalization
Multiply hot path — fees, notional	`scaled<int64_t>`	2.7× faster; avoids forced fixed-scale rescaling on multiply
Compare / sort / rank	`fixed64<-4>`	2.8–4.5× faster
Wire encode	`scaled<int64_t>`	field read, near-free
Wire decode	`fixed64<-4>`	3× faster; avoids per-value exponent normalization
String parse	`intel bid64`	3.2× faster than Boost.Decimal
String format	`scaled<int64_t>`	8.3× faster when coefficient/exponent formatting is sufficient
Inspect internals / serialize	`scaled<int64_t>`	6–18× faster via direct `as_pair()` access
General purpose / mixed scale	`intel bid64`	competitive across most operations, without a fixed-exponent constraint
Avoid for arithmetic hot paths	`bcd64`	7–20× slower than Boost.Decimal across arithmetic operations

The takeaway: decimal floating point is powerful, but it is not something to spread over the whole system like Marmite on toast. Use it where mixed scale, interchange, or standards compatibility matter. For hot arithmetic paths, especially in trading-style workloads, fixed-scale and scaled-integer representations may perform better.