Optimizing Reads of Very-Wide Compound HDF5 Types

The User’s Problem (Matt, Apr 2023)

Matt is struggling with reading HDF5 datasets comprised of extremely wide compound types (think up to ~460 fields):

Data is stored as an Nx1 array of compound structs with hundreds of fields, compressed with gzip.
Reads dominate runtime (90%)—mostly in Python/h5py.
In practice, they need all rows, but only a small subset of fields. Yet, reading a single field one at a time is slow; full-row reads are even slower.
Splitting each field into its own dataset killed write performance; the current design isn’t cutting it.

My Take: Swap Width for Depth With Smart Packing

Here’s how I’d rethink the layout from an HDF5/H5CPP standpoint:

1. Chunking + Field Grouping

Pack related fields together into chunked arrays, based on how they’re queried, not just what’s easiest to write.

Define blocks (chunks) where each contains a subset of fields that you often read together.
That way, each read aligns with your real access pattern, boosting I/O efficiency.
Yes, some fields may be stored twice in different chunks—but that’s a conscious space-for-speed trade-off.

2. Optimize Based on Queries

Design your dataset structure backward—from the query patterns, not upstream modeling considerations. Pre-encode access patterns in your layout.

Summary Table

Strategy	Pros	Trade-Offs
Wide compound structs	Simple to write	Slow for selective reads
Per-field datasets	Very selective reads	Sluggish writes, many datasets
Chunked field grouping (recommended)	Fast reads for grouped fields, aligned I/O	Slight redundancy, more planning

TL;DR

Reading “all rows but a few fields” from hundreds-wide compound datasets can crush performance, especially in Python.

The fix? Restructure your HDF5 layout to align with real query behavior—use chunked blocks that contain all frequently accessed fields together. H5CPP supports this cleanly with a modern template-based API.