HDF5 Group Overhead: Measuring the Cost of a Million Groups
📦 The Problem
On [HDF5 forum][1], a user posed a question many developers eventually run into:
"We want to serialize an object tree into HDF5, where every object becomes a group. But group overhead seems large. Why is my 5 MB dataset now 120 MB?"
That’s a fair question. And we set out to quantify it.
⚙️ Experimental Setup
We generated 1 million unique group names using h5::utils::get_test_data<std::string>(N)
, then created an HDF5 group for each name at the root level of an HDF5 file using H5CPP.
for (size_t n = 0; n < N; n++)
h5::gr_t{H5Gcreate(root, names[n].data(), H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT)};
We measured:
- Total HDF5 file size on disk
- Total size of the strings used
- Net overhead from metadata alone
📊 Results
Total file size : 79,447,104 bytes (~77MB)
Total string payload : 1,749,862 bytes
---------------------------------------------
Net metadata overhead: ~77MB
Average per group : ~776 bytes
Yep. Each group costs roughly 776 bytes, even when it contains no datasets or attributes.
📈 Visual Summary
Entry Count | File Size | Payload Size | Overhead | Avg/Group |
---|---|---|---|---|
1,000,000 | 77.94 MB | 1.75 MB | ~76.2 MB | ~776 B |
🧠 Why So Expensive?
HDF5 groups are not just simple folders—they are implemented using B-trees and heaps. Each group object has:
- A header
- Link messages
- Heap storage for names
- Possibly indexed storage for lookup
This structure scales well for access, but incurs overhead for each group created.
🛠 Can Compression Help?
No. Compression applies to datasets, not group metadata. Metadata (including group structures) is stored in an uncompressed format by default.
💡 Recommendations
- Avoid deep or wide group hierarchies with many small entries
-
If representing an object tree:
-
Consider flat structures with table-like metadata
- Store object metadata as compound datasets or attributes
-
If you're tracking time-series or per-sample metadata:
-
Store as datasets with indexing, not groups
🔚 Final Thoughts
HDF5 is flexible—but that flexibility has a price when misapplied. Using groups to represent every atomic item or configuration object results in significant metadata bloat.
Use them judiciously. Use datasets liberally.