Optimizing Reads of Very-Wide Compound HDF5 Types
The User’s Problem (Matt, Apr 2023)
Matt is struggling with reading HDF5 datasets comprised of extremely wide compound types (think up to ~460 fields):
- Data is stored as an
Nx1
array of compound structs with hundreds of fields, compressed with gzip. - Reads dominate runtime (90%)—mostly in Python/h5py.
- In practice, they need all rows, but only a small subset of fields. Yet, reading a single field one at a time is slow; full-row reads are even slower.
- Splitting each field into its own dataset killed write performance; the current design isn’t cutting it.
My Take: Swap Width for Depth With Smart Packing
Here’s how I’d rethink the layout from an HDF5/H5CPP standpoint:
1. Chunking + Field Grouping
Pack related fields together into chunked arrays, based on how they’re queried, not just what’s easiest to write.
- Define blocks (chunks) where each contains a subset of fields that you often read together.
- That way, each read aligns with your real access pattern, boosting I/O efficiency.
- Yes, some fields may be stored twice in different chunks—but that’s a conscious space-for-speed trade-off.
2. Optimize Based on Queries
Design your dataset structure backward—from the query patterns, not upstream modeling considerations. Pre-encode access patterns in your layout.
Summary Table
Strategy | Pros | Trade-Offs |
---|---|---|
Wide compound structs | Simple to write | Slow for selective reads |
Per-field datasets | Very selective reads | Sluggish writes, many datasets |
Chunked field grouping (recommended) | Fast reads for grouped fields, aligned I/O | Slight redundancy, more planning |
TL;DR
Reading “all rows but a few fields” from hundreds-wide compound datasets can crush performance, especially in Python.
The fix? Restructure your HDF5 layout to align with real query behavior—use chunked blocks that contain all frequently accessed fields together. H5CPP supports this cleanly with a modern template-based API.