Independent Dataset Extension in Parallel HDF5?
When mixing multiple data sources in a parallel application—say, one process streaming oscilloscope traces while another dumps camera frames—you’d like each to append to its own dataset independently. Unfortunately, in Parallel HDF5 this remains a collective operation.
The Experiment
Using H5CPP and phdf5-1.10.6
, we created an MPI program (mpi-extend.cpp
) where each rank writes to its own dataset. The minimum working example confirms:
H5Dset_extent
is collective: every rank must participate.- If one process attempts to extend while others do not, the program hangs indefinitely.
Output with 4 ranks:
\[rank] 2 \[total elements] 0
\[dimensions] current: {346,0} maximum: {346,inf}
\[selection] start: {0,0} end:{345,inf}
...
h5ls -r mpi-extend.h5
/io-00 Dataset {346, 400/Inf}
/io-01 Dataset {465, 0/Inf}
/io-02 Dataset {136, 0/Inf}
/io-03 Dataset {661, 0/Inf}
All datasets exist, but their extents only change collectively.
Why This Matters
Independent I/O would let unrelated processes progress without waiting on each other. In mixed workloads (fast sensors vs. slow imagers), the collective barrier becomes a bottleneck. A 2011 forum post suggested future support—but as of today, the situation is unchanged.
Workarounds
- Dedicated writer thread: funnel data from all producers into a single process (e.g., via ZeroMQ), which alone performs the
H5Dset_extent
and write operations. - Multiple files: let each process own its file, merging later.
- Virtual Datasets (VDS): stitch independent files/datasets into a logical whole after the fact.
Requirements
- HDF5 built with MPI (PHDF5)
- H5CPP v1.10.6-1
- C++17 or higher compiler
Takeaway
- 🚫 Independent dataset extension is still not supported in Parallel HDF5.
- ✅ Collective calls remain mandatory for
H5Dset_extent
. - ⚙️ Workarounds like dedicated writers or VDS can help in practice.
The bottom line: if your MPI processes produce data at different rates, plan your workflow around the collective nature of HDF5 dataset extension.