Skip to content

Independent Dataset Extension in Parallel HDF5?

When mixing multiple data sources in a parallel application—say, one process streaming oscilloscope traces while another dumps camera frames—you’d like each to append to its own dataset independently. Unfortunately, in Parallel HDF5 this remains a collective operation.

The Experiment

Using H5CPP and phdf5-1.10.6, we created an MPI program (mpi-extend.cpp) where each rank writes to its own dataset. The minimum working example confirms:

  • H5Dset_extent is collective: every rank must participate.
  • If one process attempts to extend while others do not, the program hangs indefinitely.

Output with 4 ranks:

\[rank] 2  \[total elements] 0
\[dimensions] current: {346,0}  maximum: {346,inf}
\[selection]  start: {0,0}     end:{345,inf}
...
h5ls -r mpi-extend.h5
/io-00   Dataset {346, 400/Inf}
/io-01   Dataset {465, 0/Inf}
/io-02   Dataset {136, 0/Inf}
/io-03   Dataset {661, 0/Inf}

All datasets exist, but their extents only change collectively.

Why This Matters

Independent I/O would let unrelated processes progress without waiting on each other. In mixed workloads (fast sensors vs. slow imagers), the collective barrier becomes a bottleneck. A 2011 forum post suggested future support—but as of today, the situation is unchanged.

Workarounds

  • Dedicated writer thread: funnel data from all producers into a single process (e.g., via ZeroMQ), which alone performs the H5Dset_extent and write operations.
  • Multiple files: let each process own its file, merging later.
  • Virtual Datasets (VDS): stitch independent files/datasets into a logical whole after the fact.

Requirements

  • HDF5 built with MPI (PHDF5)
  • H5CPP v1.10.6-1
  • C++17 or higher compiler

Takeaway

  • 🚫 Independent dataset extension is still not supported in Parallel HDF5.
  • ✅ Collective calls remain mandatory for H5Dset_extent.
  • ⚙️ Workarounds like dedicated writers or VDS can help in practice.

The bottom line: if your MPI processes produce data at different rates, plan your workflow around the collective nature of HDF5 dataset extension.