HDF5 Write Speeds: Matching Underlying Raw I/O

The Question (OP Concern)

A user observed that writing huge blocks (≥10 GB) with HDF5 was noticeably slower than equivalent raw writes. Understandably—they're comparing structured I/O to unadorned write() calls. The question: is this performance gap unavoidable, or can HDF5 be tuned to catch up?

What I Found (Steven Varga, Mar 28, 2023)

Here’s what I discovered—and benchmarked—on my Lenovo X1 running a recent Linux setup:

Running a cross‑product H5CPP benchmark, I consistently hit about 70–95% of the raw file system’s throughput when writing large blocks—clear evidence HDF5 is not inherently slow at scale.
With tiny or fragmented writes, overhead becomes a bigger issue—as Gerd already pointed out, direct chunk I/O is the key to performance there. Rebuild your own packer or write path around full chunks.
Or, if you want simplicity with speed, use H5CPP’s h5::append(). It streamlines buffered I/O, chunk alignment, and high throughput without manual hackery.

Here’s a snippet from my test run:

steven@io:~/projects/h5bench/examples/capi$ make
g++ -I/usr/local/include -I/usr/include -I../../include -o capi-test.o   -std=c++17 -Wno-attributes -c capi-test.cpp
g++ capi-test.o -lhdf5  -lz -ldl -lm -o capi-test
taskset 0x1 ./capi-test
[name                                              ][total events][Mi events/s] [ms runtime / stddev] [    MiB/s / stddev ]
fixed length string CAPI                                    10000     625.0000         0.02     0.000   24461.70     256.9
fixed length string CAPI                                   100000     122.7898         0.81     0.038    4917.70     213.3
fixed length string CAPI                                  1000000      80.4531        12.43     0.217    3218.60      56.6
fixed length string CAPI                                 10000000      79.7568       125.38     0.140    3189.80       3.6
rm capi-test.o

int main(int argc, const char **argv){
  size_t max_size = *std::max_element(record_size.begin(), record_size.end());

  h5::fd_t fd = h5::create("h5cpp.h5", H5F_ACC_TRUNC);
  auto strings = h5::utils::get_test_data<std::string>(max_size, 10, sizeof(fl_string_t));
    std::vector<char[sizeof(fl_string_t)]> data(strings.size());
        for (size_t i = 0; i < data.size(); i++)
            strncpy(data[i], strings[i].data(), sizeof(fl_string_t));

  // set the transfer size for each batch
  std::vector<size_t> transfer_size;
  for (auto i : record_size)
      transfer_size.push_back(i * sizeof(fl_string_t));

  //use H5CPP  modify VL type to fixed length
  h5::dt_t<fl_string_t> dt{H5Tcreate(H5T_STRING, sizeof(fl_string_t))};
  H5Tset_cset(dt, H5T_CSET_UTF8);

  std::vector<h5::ds_t> ds;
  // create separate dataset for each batch
  for(auto size: record_size) ds.push_back(
    h5::create<fl_string_t>(fd, fmt::format("fixe length string CAPI-{:010d}", size), 
    chunk_size, h5::current_dims{size}, dt));

  // EXPERIMENT: arguments, including lambda function may be passed in arbitrary order
  bh::throughput(
    bh::name{"fixed length string CAPI"}, record_size, warmup, sample,
    [&](size_t idx, size_t size_) -> double {
        hsize_t size = size_;
        // memory space
        h5::sp_t mem_space{H5Screate_simple(1, &size, nullptr )};
        H5Sselect_all(mem_space);
        // file space
        h5::sp_t file_space{H5Dget_space(ds[idx])};
        H5Sselect_all(file_space);
        // IO call
        H5Dwrite( ds[idx], dt, mem_space, file_space, H5P_DEFAULT, data.data());
        return transfer_size[idx];
    });
}

So, large writes are nearly as fast as raw file writes. The performance dip is most pronounced with smaller payloads.

What This Means In Practice

Write Size	HDF5 Performance	Takeaway
Large contiguous	~70–95% of raw I/O	HDF5 is performant at scale
Small fragments	Lower efficiency	Use chunk-based buffering
Need simplicity + speed	Use `h5::append()`	Combines clarity and performance

TL;DR

HDF5 is fast—large-block writes approach raw I/O speeds.
Inefficiency creeps in with tiny or non-aligned writes.
Solution? Direct chunk I/O or use H5CPP’s h5::append() for elegant, high-throughput data streaming.