Skip to content

April 2021

Fixed-Length vs. Variable-Length Storage in HDF5

HDF5 gives you two ways to store “string-like” or array-like data: fixed-length and variable-length. Each comes with trade-offs, and we benchmarked them head-to-head.

The Setup

We compared writing large arrays of simple POD records, stored either as:

  • Fixed-length fields: every record has the same size.
  • Variable-length fields: each record may grow or shrink.

The benchmark (hdf5-fixed-length-bench.cpp) measures throughput for millions of writes, simulating common HPC/quant workloads.

#include <iostream>
#include <vector>
#include <algorithm>
#include <h5bench>
#include <h5cpp/core>
#include "non-pod-struct.hpp"
#include <h5cpp/io>
#include <fmt/core.h>
#include <fstream>

namespace bh = h5::bench;
bh::arg_x record_size{10'000}; //, 100'000, 1'000'000};
bh::warmup warmup{3};
bh::sample sample{10};
h5::dcpl_t chunk_size = h5::chunk{4096};

std::vector<size_t> get_transfer_size(const std::vector<std::string>& strings ){
    std::vector<size_t> transfer_size;
    for (size_t i =0, j=0, N = 0; i < strings.size(); i++){
        N += strings[i].length();
        if( i == record_size[j] - 1) j++, transfer_size.push_back(N);
    }
    return transfer_size;
}

template<class T> std::vector<T> convert(const std::vector<std::string>& strings){
    return std::vector<T>();
}
template <> std::vector<char[shim::pod_t::max_lenght::value]> convert(const std::vector<std::string>& strings){
    std::vector<char[shim::pod_t::max_lenght::value]> out(strings.size());
    for (size_t i = 0; i < out.size(); i++)
        strncpy(out[i], strings[i].data(), shim::pod_t::max_lenght::value);
    return out;
}

std::vector<const char*> get_data(const std::vector<std::string>& strings){
    std::vector<const char*> data(strings.size());
    // build a array of pointers to VL strings: one level of indirection 
    for (size_t i = 0; i < data.size(); i++)
        data[i] = (char*) strings[i].data();
    return data;
}

std::vector<h5::ds_t> get_datasets(const h5::fd_t& fd, const std::string& name, h5::bench::arg_x& rs){
    std::vector<h5::ds_t> ds;

    for(size_t i=0; i< rs.rank; i++)
        ds.push_back( h5::create<std::string>(fd, fmt::format(name + "-{:010d}", rs[i]), h5::current_dims{rs[i]}, chunk_size));

    return ds;
}

int main(int argc, const char **argv){
    size_t max_size = *std::max_element(record_size.begin(), record_size.end());

    h5::fd_t fd = h5::create("h5cpp.h5", H5F_ACC_TRUNC);
    auto strings = h5::utils::get_test_data<std::string>(max_size, 10, shim::pod_t::max_lenght::value);

    // LETS PRINT PUT SOME STRINGS TO GIVE YOU THE PICTURE
    fmt::print("[{:5>}] [{:^30}] [{:6}]\n", "#", "value", "lenght");
    for(size_t i=0; i<10; i++) fmt::print("{:>2d}  {:>30}  {:>8d}\n", i, strings[i], strings[i].length());
    fmt::print("\n\n");

    { // POD: FIXED LENGTH STRING + ID
        h5::pt_t ds = h5::create<shim::pod_t>(fd, "FLstring h5::append<pod_t>", h5::max_dims{H5S_UNLIMITED}, chunk_size);
        std::vector<shim::pod_t> data(max_size);
        // we have to copy the string into the pos struct
        for (size_t i = 0; i < data.size(); i++)
            data[i].id = i, strncpy(data[i].name, strings[i].data(), shim::pod_t::max_lenght::value);

        // compute data transfer size, we will be using this to measure throughput:
        std::vector<size_t> transfer_size;
        for (auto i : record_size)
            transfer_size.push_back(i * sizeof(shim::pod_t));

        // actual measurement with burn in phase
        bh::throughput(
            bh::name{"FLstring h5::append<pod_t>"}, record_size, warmup, sample, ds,
            [&](hsize_t idx, hsize_t size) -> double {
                for (hsize_t k = 0; k < size; k++)
                    h5::append(ds, data[k]);
                return transfer_size[idx];
            });
    }

    { // VL STRING, INDEXED BY HDF5 B+TREE, h5::append<std::string>
        h5::pt_t ds = h5::create<std::string>(fd, "VLstring h5::append<std::vector<std::string>> ", h5::max_dims{H5S_UNLIMITED}, chunk_size);
        std::vector<size_t> transfer_size = get_transfer_size(strings);
        // actual measurement with burn in phase
        bh::throughput(
            bh::name{"VLstring h5::append<std::vector<std::string>>"}, record_size, warmup, sample,
            [&](hsize_t idx, hsize_t size) -> double {
                for (hsize_t i = 0; i < size; i++)
                    h5::append(ds, strings[i]);
                return transfer_size[idx];
            });
    }
    { // VL STRING, INDEXED BY HDF5 B+TREE std::vector<std::string>
        auto ds = get_datasets(fd, "VLstring h5::write<std::vector<const char*>> ", record_size);
        std::vector<const char*> data = get_data(strings);
        std::vector<size_t> transfer_size = get_transfer_size(strings);

        // actual measurement with burn in phase
        bh::throughput(
            bh::name{"VLstring h5::write<std::vector<const char*>>"}, record_size, warmup, sample,
            [&](hsize_t idx, hsize_t size) -> double {
                h5::write(ds[idx], data.data(), h5::count{size});
                return transfer_size[idx];
            });
    }

    { // VL STRING, INDEXED BY HDF5 B+TREE std::vector<std::string>
        auto ds = get_datasets(fd, "VLstring std::vector<std::string> ", record_size);
        std::vector<size_t> transfer_size = get_transfer_size(strings);
        // actual measurement with burn in phase
        bh::throughput(
            bh::name{"VLstring std::vector<std::string>"}, record_size, warmup, sample,
            [&](hsize_t idx, hsize_t size) -> double {
                h5::write(ds[idx], strings, h5::count{size});
                return transfer_size[idx];
            });
    }

    { // FL STRING, INDEXED BY HDF5 B+TREE std::vector<std::string>
        using fixed_t = char[shim::pod_t::max_lenght::value]; // type alias

        std::vector<size_t> transfer_size;
        for (auto i : record_size)
            transfer_size.push_back(i * sizeof(fixed_t));
        std::vector<fixed_t> data = convert<fixed_t>(strings);

        // modify VL type to fixed length
        h5::dt_t<fixed_t> dt{H5Tcreate(H5T_STRING, sizeof(fixed_t))};
        H5Tset_cset(dt, H5T_CSET_UTF8); 

        std::vector<h5::ds_t> ds;
        for(auto size: record_size) ds.push_back(
                h5::create<fixed_t>(fd, fmt::format("FLstring CAPI-{:010d}", size), 
                chunk_size, h5::current_dims{size}, dt));

        // actual measurement
        bh::throughput(
            bh::name{"FLstring CAPI"}, record_size, warmup, sample,
            [&](hsize_t idx, hsize_t size) -> double {
                // memory space
                h5::sp_t mem_space{H5Screate_simple(1, &size, nullptr )};
                H5Sselect_all(mem_space);
                // file space
                h5::sp_t file_space{H5Dget_space(ds[idx])};
                H5Sselect_all(file_space);

                H5Dwrite( ds[idx], dt, mem_space, file_space, H5P_DEFAULT, data.data());
                return transfer_size[idx];
            });
    }

    { // Variable Length STRING with CAPI IO calls
        std::vector<size_t> transfer_size = get_transfer_size(strings);
        std::vector<const char*> data = get_data(strings);

        h5::dt_t<char*> dt;
        std::vector<h5::ds_t> ds;

        for(auto size: record_size) ds.push_back(
            h5::create<char*>(fd, fmt::format("VLstring CAPI-{:010d}", size), 
            chunk_size, h5::current_dims{size}));

        // actual measurement
        bh::throughput(
            bh::name{"VLstring CAPI"}, record_size, warmup, sample,
            [&](hsize_t idx, hsize_t size) -> double {
                // memory space
                h5::sp_t mem_space{H5Screate_simple(1, &size, nullptr )};
                H5Sselect_all(mem_space);
                // file space
                h5::sp_t file_space{H5Dget_space(ds[idx])};
                H5Sselect_all(file_space);

                H5Dwrite( ds[idx], dt, mem_space, file_space, H5P_DEFAULT, data.data());
                return transfer_size[idx];
            });
    }

    { // C++ IO stream
        std::vector<size_t> transfer_size = get_transfer_size(strings);
        std::ofstream stream;
        stream.open("somefile.txt", std::ios::out);

        // actual measurement
        bh::throughput(
            bh::name{"C++ IOstream "}, record_size, warmup, sample,
            [&](hsize_t idx, hsize_t size) -> double {
                for (hsize_t k = 0; k < size; k++)
                    stream << strings[k] << std::endl;
                return transfer_size[idx];
            });
        stream.close();
    }
}

Results

  • Fixed-length outperforms variable-length by a wide margin.
  • Predictable size means HDF5 can lay out data contiguously and stream it efficiently.
  • Variable-length introduces extra indirection and heap management, slowing things down.

In our runs, fixed-length writes achieved 70–95% of raw I/O speed, while variable-length lagged substantially behind.

Why It Matters

  • If your schema permits it, prefer fixed-length types.
  • Use variable-length only when data sizes truly vary (e.g., ragged arrays, free-form strings).
  • For high-frequency trading, sensor arrays, or scientific simulations, fixed-length layouts maximize throughput.

POD Check

We also verified which record types qualify as POD (Plain Old Data) via a small utility (is-pod-test.cpp). Only POD-eligible types map safely and efficiently into HDF5 compound layouts.

```cpp static_assert(std::is_trivial_v); static_assert(std::is_standard_layout_v); ````

This ensures compatibility with direct binary writes—no surprises from constructors, vtables, or hidden padding.

Takeaway

  • ✅ Fixed-length fields: fast, predictable, near raw I/O.
  • ⚠️ Variable-length fields: flexible, but slower.
  • 🔧 Use POD records to unlock HDF5’s full performance potential.

If performance is paramount, lock in fixed sizes and let your data pipeline fly.

Bridging HPC Structs and HDF5 COMPOUNDs with H5CPP

🚀 The Problem

You’re running simulations or doing scientific computing. You model data like this:

struct record_t {
    double temp;
    double density;
    double B[3];
    double V[3];
    double dm[20];
    double jkq[9];
};
````

Now you want to persist these structs into an HDF5 file using the COMPOUND datatype.With the standard C API? That means 20+ lines of verbose, error-prone setup. With H5CPP? Just include `struct.h` and let the tools handle the rest.


## 🔧 Step-by-Step with H5CPP
### 1. Define Your POD Struct

```cpp
namespace sn {
    struct record_t {
        double temp;
        double density;
        double B[3];
        double V[3];
        double dm[20];
        double jkq[9];
    };
}

2. Generate Type Descriptors

Invoke the H5CPP LLVM-based code generator:

h5cpp struct.cpp -- -std=c++17 -I. -Dgenerated.h

It will emit a generated.h file that defines a specialization for:

h5::register_struct<sn::record_t>()

This registers an HDF5 compound type at runtime, automatically.

🧪 Example Usage

Here’s how you write/read a compound dataset with zero HDF5 ceremony:

#include "struct.h"
#include <h5cpp/core>
#include "generated.h"
#include <h5cpp/io>

int main(){
    h5::fd_t fd = h5::create("test.h5", H5F_ACC_TRUNC);

    // Create dataset with shape (70, 3, 3)
    h5::create<sn::record_t>(fd, "/Module/g_data", h5::max_dims{70, 3, 3});

    // Read it back
    auto records = h5::read<std::vector<sn::record_t>>(fd, "/Module/g_data");
    for (auto rec : records)
        std::cerr << rec.temp << " ";
    std::cerr << std::endl;
}

🔍 What generated.h Looks Like

The generated descriptor maps your struct fields to HDF5 types:

template<> hid_t inline register_struct<sn::record_t>() {
    hid_t ct = H5Tcreate(H5T_COMPOUND, sizeof(sn::record_t));
    H5Tinsert(ct, "temp", HOFFSET(sn::record_t, temp), H5T_NATIVE_DOUBLE);
    H5Tinsert(ct, "density", HOFFSET(sn::record_t, density), H5T_NATIVE_DOUBLE);
    ...
    return ct;
}
Nested arrays (like B[3]) are flattened using H5Tarray_create, and all internal hid_t handles are cleaned up.

🧵 Thread-safe and Leak-free

Generated code avoids resource leaks by closing array types after insertion, keeping everything safe and clean:

H5Tclose(at_00); H5Tclose(at_01); H5Tclose(at_02);

🧠 Why This Matters

HDF5 is excellent for structured scientific data. But the C API is boilerplate-heavy and distracts from the real logic. H5CPP eliminates this:

  • Describe once, reuse everywhere
  • Autogenerate glue code
  • Zero-copy semantics, modern C++17 syntax
  • Support for nested arrays and multidimensional shapes

✅ Conclusion

If you're working with scientific data in C++, H5CPP gives you the power of HDF5 with the simplicity of a header file. Skip the boilerplate. Focus on science.

Independent Dataset Extension in Parallel HDF5?

When mixing multiple data sources in a parallel application—say, one process streaming oscilloscope traces while another dumps camera frames—you’d like each to append to its own dataset independently. Unfortunately, in Parallel HDF5 this remains a collective operation.

The Experiment

Using H5CPP and phdf5-1.10.6, we created an MPI program (mpi-extend.cpp) where each rank writes to its own dataset. The minimum working example confirms:

  • H5Dset_extent is collective: every rank must participate.
  • If one process attempts to extend while others do not, the program hangs indefinitely.

Output with 4 ranks:

\[rank] 2  \[total elements] 0
\[dimensions] current: {346,0}  maximum: {346,inf}
\[selection]  start: {0,0}     end:{345,inf}
...
h5ls -r mpi-extend.h5
/io-00   Dataset {346, 400/Inf}
/io-01   Dataset {465, 0/Inf}
/io-02   Dataset {136, 0/Inf}
/io-03   Dataset {661, 0/Inf}

All datasets exist, but their extents only change collectively.

Why This Matters

Independent I/O would let unrelated processes progress without waiting on each other. In mixed workloads (fast sensors vs. slow imagers), the collective barrier becomes a bottleneck. A 2011 forum post suggested future support—but as of today, the situation is unchanged.

Workarounds

  • Dedicated writer thread: funnel data from all producers into a single process (e.g., via ZeroMQ), which alone performs the H5Dset_extent and write operations.
  • Multiple files: let each process own its file, merging later.
  • Virtual Datasets (VDS): stitch independent files/datasets into a logical whole after the fact.

Requirements

  • HDF5 built with MPI (PHDF5)
  • H5CPP v1.10.6-1
  • C++17 or higher compiler

Takeaway

  • 🚫 Independent dataset extension is still not supported in Parallel HDF5.
  • ✅ Collective calls remain mandatory for H5Dset_extent.
  • ⚙️ Workarounds like dedicated writers or VDS can help in practice.

The bottom line: if your MPI processes produce data at different rates, plan your workflow around the collective nature of HDF5 dataset extension.