HDF5 Group Overhead: Measuring the Cost of a Million Groups

📦 The Problem

On [HDF5 forum][1], a user posed a question many developers eventually run into:

"We want to serialize an object tree into HDF5, where every object becomes a group. But group overhead seems large. Why is my 5 MB dataset now 120 MB?"

That’s a fair question. And we set out to quantify it.

⚙️ Experimental Setup

We generated 1 million unique group names using h5::utils::get_test_data<std::string>(N), then created an HDF5 group for each name at the root level of an HDF5 file using H5CPP.

for (size_t n = 0; n < N; n++)
    h5::gr_t{H5Gcreate(root, names[n].data(), H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT)};

We measured:

Total HDF5 file size on disk
Total size of the strings used
Net overhead from metadata alone

📊 Results

Total file size      :  79,447,104 bytes (~77MB)
Total string payload :   1,749,862 bytes
---------------------------------------------
Net metadata overhead: ~77MB

Average per group    : ~776 bytes

Yep. Each group costs roughly 776 bytes, even when it contains no datasets or attributes.

📈 Visual Summary

Entry Count	File Size	Payload Size	Overhead	Avg/Group
1,000,000	77.94 MB	1.75 MB	~76.2 MB	~776 B

🧠 Why So Expensive?

HDF5 groups are not just simple folders—they are implemented using B-trees and heaps. Each group object has:

A header
Link messages
Heap storage for names
Possibly indexed storage for lookup

This structure scales well for access, but incurs overhead for each group created.

🛠 Can Compression Help?

No. Compression applies to datasets, not group metadata. Metadata (including group structures) is stored in an uncompressed format by default.

💡 Recommendations

Avoid deep or wide group hierarchies with many small entries
If representing an object tree:
Consider flat structures with table-like metadata
Store object metadata as compound datasets or attributes
If you're tracking time-series or per-sample metadata:
Store as datasets with indexing, not groups

🔚 Final Thoughts

HDF5 is flexible—but that flexibility has a price when misapplied. Using groups to represent every atomic item or configuration object results in significant metadata bloat.

Use them judiciously. Use datasets liberally.