Skip to content

hdf5

Best Way to Store Tick-by-Tick Stock Data in HDF5

The Problem (OP asks)

I'm storing stock tick-by-tick data—what’s the best structure in HDF5 to handle this efficiently and flexibly?

Perspective from H5CPP (Steven Varga, May 19, 2018)

If you’re working with high-frequency time series (HFT-style), H5CPP is tailor-made for event-based datasets. It plays well with std::vector<> of your POD structs and most linear algebra stacks (e.g., Armadillo).

Two Design Patterns to Consider

​​ Extendable Multi-Dimensional Cube

You can model your tick data as an extendable cube (or even higher dimensions), then slice what you need:

h5::create(fd, {n_minutes_in_day, n_instruments, H5_UNLIMITED}); // creates extendable dataset

Armadillo can pull in cubes directly. H5CPP supports partial read/write, so you can map arbitrary hyperslabs into:

  • arma::cube (3D)
  • arma::mat (2D)
  • std::vector<YourPOD> (1D)

This gives you powerful flexibility if you want to access time-instrument windows efficiently. A single multi-dimensional dataset can model all your data—but its harder to manage day boundaries or rolling windows.

So instead, keep things simple:

  • Use one daily stream for your high-frequency irregular time series (IRTS)—packed as a 1D sequence.
  • Use daily matrices for regular time series (RTS) when appropriate.
  • Then stitch these together using a circular buffer to model your clock window (e.g., last N days).

That gives you clean separation for day-level operations, easy rolling windows across days, and better control inside your streaming logic.

Avoid Tables for Tick Data

HDF5 tables (i.e., compound heterogeneous datasets) are cumbersome:

  • Hard to use in linear algebra contexts
  • Not easily mapped to vectors or matrices
  • Often less performant for numerical processing

Stick with floats or doubles in a numeric dataset, and keep your tick schema POD-friendly.

Summary of Approaches

Pattern Use Case H5CPP-Enabled Benefits
Extendable Cube Multi-dimensional, flexible slicing Easy high-D slicing, integrates with arma/Stl
Daily Stream + Circular Day-partitioned, rolling windows Simpler structure, rolling window stitching
HDF5 Tables (compound) Rich schema, less focus on math Clunky for numeric analysis, avoid if possible

TL;DR

H5CPP shines for stock tick data. Model your data as:

  • A hybrid of daily streams and rolling buffers, or
  • A flexible extendable cube if you want high-D slices.

Either way, avoid compound table datasets for numeric speed and maintain clean interop with your analysis stacks.

Fixing Hyperslab Slices: Simplify with H5CPP Column Reads

The Issue (Bert Bril, Apr 26, 2018)

Bert was trying to extract an entire column from a really large 3D dataset—perhaps terabyte scale—into a 1D array using neat hyperslab coordinate selection. But the results were all wrong: Two out of every three values ended up zero, even though his start/count looked correct. He wasn’t even requesting broad slicing—just first column, full depth—yet the read was skipping data.

Steven Varga’s Insight (Apr 27, 2018)

Steve led with a reality check and a practical recommendation: Bang it into a simpler problem: If the dataset fits in memory, load the whole cube into an arma::cube and slice from there. If it doesn’t fit: Switch to a chunked column reader with H5CPP:

double* pd = static_cast<double*>(calloc(ROW_SIZE, sizeof(double)));
hid_t fd = h5::open("your_datafile.h5", H5F_ACC_RDONLY);
hid_t ds = h5::open(fd, "your_dataset");

for (int i = 0; i < data_rows; ++i) {
    h5::read(ds, pd, {0, i, 0}, {COL_SIZE, 1, 1});
    // 'start': {row, col, slice}; 'count': {row_count, 1, 1}
    // Apply your own stride/sieve logic here if needed
}

H5Dclose(ds);
H5Fclose(fd);
He noted that stride support is coming—but often manual subsetting is easier and just as fast. In short: don’t fight the API; control your data fetch with a clean loop and let H5CPP handle the heavy lifting.

Why This Fixes It

Step Reason
Simplify the problem Verifies hyperslab logic without third-party complexity
Use H5CPP for reads Offers clear, chunked access that's easy to reason about and debug
Manual sieve control Keeps logic explicit—no hidden behavior or unexpected flattening

Simplifying Hyperslab Reads with H5CPP

The Problem (Bert Bril, Apr 26, 2018)

Bert was trying to read a single column from a massive 3D dataset:

short data[251];
hsize_t count[3] = {1, 1, 251};
hsize_t offset[3] = {0, 0, 0};
hsize_t stride[3] = {1, 1, 1};

filedataset.selectHyperslab(H5S_SELECT_SET, count, offset, stride);
H5::DataSpace input = filedataset.getSpace();
H5::DataSpace output(1, &nrtoread);
dataset_->read(data, H5::PredType::STD_I16LE, output, input);
But weirdly, the data read back had two out of every three values missing: 3786 0 0 3555 0 0 3820 … Even simpler 1D reads using a 3D hyperslab weren’t behaving as expected—and Bert couldn’t figure out why.

Steven Varga to the Rescue (Apr 26, 2018)

Steven’s suggestion? Toss the boilerplate and go with H5CPP—a clean, MIT-licensed, header-only C++ template library for HDF5 that makes multidimensional, Armadillo-style reads so much simpler:

“Reading a cube into an arma::cube is one-step—see the Armadillo example—and use arma::Cube instead of arma::Mat.” H5CPP is well-documented, functional, with examples, and integrates with major linear-algebra libraries. — Steve

H5CPP lets you sidestep manual DataSpace manipulations by offering high-level read() calls that feel like native matrix slicing.

Why This Helps

Approach Complexity Behavior Clarity Extensibility
Manual Hyperslab (HDF5 C++) High Error-prone and subtle Hard to debug
H5CPP (template-based) Low Clear intent and results Easy integration with arma, eigen, etc.

What to Do Next

  1. Try H5CPP—especially for reading cubes, columns, or slices with Armadillo support.

  2. Skip the complex hyperslab boilerplate and reach for a higher-level read() that feels like:

arma::cube C;
h5::read(ds, C); // simplified multi-dimensional read
  1. If you’re still seeing missing elements, compare dimension ordering or endian assumptions—but often H5CPP's clean interface surfaces logic over noise.

Fixing h5repack Test Failures During make check

The Problem (by leow155, March 19, 2018)

leow155 ran into a snag when building HDF5 (v1.10.1) on Ubuntu 16.04:

./configure --prefix=/usr/local/hdf5 --enable-build-mode=production
make
make check

Everything builds fine, but make check throws an error during the h5repack tests:

h5repack tests failed with 1 errors.  
Command exited with non-zero status 1  
...
Makefile:1449: recipe for target ‘h5repack.sh.chkexe_’ failed
...

Clearly, h5repack.sh.chkexe_ fails and bubbles up through the test suite.

Steven Varga’s Reply (March 19, 2018)

Greetings!

I’m running the same HDF5 version (1.10.1) on Linux Mint 18 (Ubuntu 16.04 under the hood) with zero errors—and I didn’t even specify --enable-build-mode=production, because it’s already the default.

Did a quick make check moments ago, and it all passed cleanly on my setup.

Best, Steve

What to Make of This

Issue Steven’s Observation Suggested Action
h5repack test failure Works fine on his Ubuntu 16.04 setup Retry clean build without production flag
Possible environment mismatch Steven's env passed—identical build config Align compiler, tool versions, locale, etc.

Next Steps to Try

  1. Drop the --enable-build-mode=production flag—Steven didn't use it, and it may subtly change the build behavior.
  2. Re-clone and rebuild cleanly—use Git tag hdf5-1_10_1 to match his exact reference (avoid tarballs that may introduce unintended patches).
  3. Audit your toolchain—ensure your gcc, automake, make versions align with Steven’s Ubuntu/Mint environment, as small differences can trip specific tests.

TL;DR

You’re not hallucinating—h5repack should pass make check. Steven Varga verified it does, on a near-identical system (Ubuntu 16.04 via Linux Mint 18). Your best bet: align your build environment and try a clean rebuild without forcing production mode. Let me know if you'd like help comparing your tool versions or automating a clean rebuild.

Parallelization Patterns for HDF5 I/O in C++

The Question (Stefano Salvadè, Feb 19, 2018)

Goal: write analysis results in parallel to multiple HDF5 files—one per stream/process. The application is in C#, calling into a C/C++ API with HDF5 and MPI.

Current thought: typically one uses mpiexec -n x Program.exe, but spawning processes at runtime via MPI_Spawn() seems clunky.

Is there a more elegant way to spawn parallel I/O functions within the same program? Also, do I need one process per write action (whether to separate files or a single shared file)?


Steven Varga’s Take—Less PHDF5, More Pragmatism

Parallel HDF5 (PHDF5) shines in setups with parallel file systems and true distributed environments—think HPC clusters with coordinated I/O capabilities.
But in simpler contexts—e.g., a single machine or cloud instance—PHDF5 often imposes unused complexity and file-system limitations (filters unsupported, extra boilerplate, etc.).
Instead, Stefano could:
1. Use separate HDF5 files per process, even in RAM or temp storage
2. Aggregate later, e.g. via:
- copying into one file, or
- using HDF5’s external file driver to compose them into a single logical container
The aggregation step could run as a separate batch job after the main MPI job finishes.
If you do have real parallel I/O infrastructure, then yes—PHDF5 gives benefits. But often, simple is better.
Steve


Summary Table

Scenario Recommended Approach Reasoning
N streams → N separate files (no shared file) Serial HDF5 per process Simplicity, no PHDF5 overhead, independent files
Need to combine results later Aggregate post-run (external file driver or merge scripts) Keeps initial write simple; flexible downstream processing
True parallel I/O on a parallel file system Use PHDF5 with MPI Efficient coordinated I/O, but more complexity and system requirements

When to Use What?

  • Use PHDF5 when:
  • You're in a high-performance cluster environment
  • The file system supports parallel write throughput
  • You benefit from collective operations and synchronized metadata handling

  • Stick with serial HDF5 per process when:

  • You're on a single system or cloud VM
  • You want to avoid complexity in your write path
  • You can afford a merge or collector step after the run

Wrap-Up Thoughts

Stefano’s “elegant parallel output within a single program” goal doesn’t necessarily require MPI-spawned processes or PHDF5. Often the simplest is best: spawn OS-level processes writing to their own files, then merge or link them later.
This keeps performance high, complexity low, and coordination overhead manageable.

Templatized Mapping of HDF5 Native Types in Modern C++

The Problem (Posted by Mark C. Miller, Feb 5, 2018)

Mark needed a simple way to map std::vector<T> to the correct H5T_NATIVE_* type in a templated dataset writer:

template <class T>
static void WriteVecToHDF5(hid_t fid, const char* name,
                           const std::vector<T>& vec, int d2size)
{
    hsize_t dims[2] = {vec.size() / d2size, d2size};
    hid_t spid = H5Screate_simple(d2size > 1 ? 2 : 1, dims, nullptr);
    hid_t dsid = H5Dcreate(fid, name, HDF5Type<T>().Type(), spid,
                           H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    H5Dwrite(dsid, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, &vec[0]);
    H5Dclose(dsid);
    H5Sclose(spid);
}
````

He implemented a hierarchy using `HDF5Type<T>()`, specialized per type to return the corresponding native constant (`H5T_NATIVE_INT`, `H5T_NATIVE_FLOAT`, etc.).

But... inheritance, virtuals, and boilerplate made him wonder: *Is there a cleaner way?*


## The H5CPP Way: Keep It Simple and Modern (Steven Varga)

Steven Varga, author of **H5CPP**, weighed in with a minimalist approach inspired by modern C++ designditch virtual dispatch, avoid inheritance, and leverage function overloads and template inference.

### ✅ Solution

```cpp
inline hid_t h5type(const char&)                { return H5T_NATIVE_CHAR; }
inline hid_t h5type(const signed char&)         { return H5T_NATIVE_SCHAR; }
inline hid_t h5type(const unsigned char&)       { return H5T_NATIVE_UCHAR; }
inline hid_t h5type(const short&)               { return H5T_NATIVE_SHORT; }
inline hid_t h5type(const unsigned short&)      { return H5T_NATIVE_USHORT; }
inline hid_t h5type(const int&)                 { return H5T_NATIVE_INT; }
inline hid_t h5type(const unsigned int&)        { return H5T_NATIVE_UINT; }
inline hid_t h5type(const long&)                { return H5T_NATIVE_LONG; }
inline hid_t h5type(const unsigned long&)       { return H5T_NATIVE_ULONG; }
inline hid_t h5type(const long long&)           { return H5T_NATIVE_LLONG; }
inline hid_t h5type(const unsigned long long&)  { return H5T_NATIVE_ULLONG; }
inline hid_t h5type(const float&)               { return H5T_NATIVE_FLOAT; }
inline hid_t h5type(const double&)              { return H5T_NATIVE_DOUBLE; }
inline hid_t h5type(const long double&)         { return H5T_NATIVE_LDOUBLE; }

template<typename T>
inline hid_t h5type() { return h5type(T()); }

✅ Usage

hid_t type_id = h5type<T>();

Or, more directly:

hid_t dsid = H5Dcreate(fid, name, h5type<T>(), spid,
                       H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

Benefits of This Pattern

Feature Virtual Class Approach Overload-Based Approach
Compile-time resolution
Easy to extend for new types ⚠️ (needs specialization) ✅ (just another overload)
Runtime dispatch
Readability ⚠️ verbose ✅ concise
Performance (no virtual calls)

Conclusion

The overload-based mapping pattern presented by Steven Varga in response to the HDF forum thread reflects a clean, idiomatic modern C++ style:

  • Clear separation of logic
  • No need for inheritance boilerplate
  • Efficient and extensible

If you're designing a lightweight HDF5 backend—or building C++ infrastructure like H5CPP—this pattern is robust and production-ready.

Optimizing HDF5 for Event-Based Time Series Access

Original Question (Tamas, Mar 30, 2017)

I am currently using HDF5 to store my data and I have a structure that is too slow for reading single events.
My first attempt was to have a table of all hits, including the event_id as a field, so a simple table of:

hit_id | event_id | dom_id | time | tot | triggered | pmt_id

This is extremely slow for reading single events, as I have to go through all hits to extract one event.
My second attempt was to create groups for each event, and place the hits in a dataset inside that group. This also turned out to be slow.
So far the only solution I have found is to go back to ROOT, which is a bit of a shame.
Has anyone worked with similar data structures in HDF5? What is the fastest way of reading a single event?

Response (Steven Varga, Mar 31, 2017)

Hello Tamas,

I use HDF5 to store irregular time series (IRTS) data in the financial domain, where performance and storage density both matter. Here’s the approach I’ve been using successfully for the past 5 years:

🧱 Dataset Structure

  • Data is partitioned per day, each stored as a variable-length dataset (VLEN) containing a stream of events.
  • The dataset holds a custom compound datatype, tailored for compactness and compatibility.

For example, the record type might look like:

[event_id, asset, price, timestamp, flags, ...]

This layout reduces overhead and improves both sequential read throughput and storage density.

🔧 Access Pattern

  • The system is designed for write-once, read-many, with sequential reads being the dominant access mode.
  • For processing, I use C++ iterators to walk through the event stream efficiently.
  • This low-level backend is then exposed to Julia, R, and Python bindings for analysis pipelines.

Because the datatype is stored directly in the file, it’s also viewable with HDFView for inspection or debugging.

🚀 Performance Context

  • Runs in an MPI cluster using C++, Julia, and Rcpp.
  • Delivers excellent performance under both full-stream and filtered-access scenarios.
  • Chunk size does affect access time significantly—tuning that is essential depending on your use case.

🧠 Optimization Trade-Off

To your question—"how to access all or only some events efficiently"—that’s inherently a bi-objective optimization:

  • You can gain speed by trading off space, e.g., pre-sorting, duplicating metadata, or chunking strategically.

So while HDF5 might need more careful tuning than ROOT out of the box, it gives you portability and schema clarity across toolchains.

Hope this helps, Steve

HDF5 for In-Memory Circular Buffers: Reuse Schema in RAM

The Original Ask (David Schneider, Dec 10 2016)

David from SLAC/LCLS posed this neat challenge:

“Is it possible to implement an in‑memory circular buffer using HDF5? We'd like both offline (on‑disk append) and online (shared‑memory overwrite) access via the same HDF5 schema and API, possibly using SWMR for shared-memory consumption.” :contentReference[oaicite:1]{index=1} Basically, a single schema that works both for archival and real-time consumption—elegant.

1. Werner's Insight: The Virtual File Driver

Werner dropped the elegant solution:

“There is a virtual file driver that operates on a memory image of an HDF5 file. It should be no problem to have this one also operate on shared memory.” :contentReference[oaicite:2]{index=2}

That’s referencing HDF5’s core VFD—you can treat a pointer to memory (including shared memory via mmap or shm_open) as if it were an HDF5 file. The same dataset API (H5Dcreate, H5Dwrite, etc.) applies, so you can reuse your schema seamlessly.

2. Steve’s Real‑World Twist (HFT-inspired)

Steve Varga chimed with a production-grade twist:

“Boost’s circular/ring buffer handles one-writer-many-readers; tail flushing can be channeled to the writer or fault‑tolerant hosts. Combine with ZeroMQ + Protocol Buffers or Thrift.”
“For experiments—where failure isn't critical—you can just access HDF5 locally on cluster nodes using MPI + Grid Engine + serial HDF5.” :contentReference[oaicite:3]{index=3}

So if you're doing industrial-strength durability, go ring buffer + messaging middleware. For HPC experiments where speed and simplicity triump, stick with HDF5+MPI.

Quick API Sketch (Julia‑Flavored)

```julia using HDF5, SharedMemory # hypothetical module?

fid = h5open_sharedmem(shm_address, mode="r+")

Use HDF5 API as if working on a real file

dset = d_create(fid, "/buffer", datatype=Float64, dims=(N,), maxdims=(HDF5.UNLIMITED,)) write(dset, new_chunk) close(fid) ````

Summary Table

Scenario Approach Notes
On-disk appendable buffer HDF5 datasets (append mode) Standard functionality
In-memory circular buffer HDF5 via core VFD over a memory region Shared schema/API in RAM
High‑throughput, production-grade Boost ring buffer + messaging (ZeroMQ, ProtoBuf) More robust, fault-tolerant
Experimental/distributed HPC HDF5 per node + MPI/Scheduler (serial HDF5) Simple, performance-focused

Using HDF5 as an In-Memory Circular Buffer

Context

The HDF5 library is a powerful solution for structured data storage, but its default usage assumes durable file-backed I/O. What if we want to use the same layout and tooling for in-memory circular buffers, especially across multiple processes?

This idea came up in a mailing list thread posted by David Schneider in 2016. The question was simple but practical:

“Can I use HDF5 like a circular buffer in memory, with live updates and multiple consumers, using the same schema we already use for on-disk archival?”

That hit close to home—we’d solved a similar problem in our trading systems.

Perspective

We faced the same challenges in building real-time market data pipelines. We needed: - A buffer of recent events in memory (circular structure) - Multi-process access (writer + readers) - A clean way to flush or archive data to disk - The same schema shared across both memory and persistent storage

Instead of twisting HDF5 into a fully-fledged circular buffer, we used HDF5’s virtual file driver (VFD) to great effect.

Solution

1. Boost Ring Buffer + IPC

In systems where latency and determinism matter, we use:

  • Boost's circular_buffer for the in-memory structure
  • ZeroMQ + Protocol Buffers (or Thrift) for pub/sub messaging
  • A fallback mechanism that flushes the buffer’s tail into disk-backed HDF5 for audit or recovery

This gives us: - One writer, many readers (process-safe) - Fault tolerance (via tail-dump or WAL-like shadowing) - Interop with Python, Julia, R via schema-consistent I/O

2. Experimental Mode: One HDF5 File Per Node

In distributed computation (e.g., HPC or large-scale simulations), I skip shared memory entirely: - Run each task independently using serial HDF5 - Use MPI + grid engine orchestration - Merge or reduce results later

The simplicity here avoids shared memory complexity and works well for experimental setups or large batch jobs.

Could You Do It In Pure HDF5?

Yes—using H5Pset_fapl_core() you can instruct HDF5 to treat a memory region as the backing store. In theory, you could:

  1. mmap or shm_open() a fixed-size region
  2. Initialize an HDF5 file layout in that region
  3. Write with wrap-around logic (circular overwrite)
  4. Map other readers to the same memory block

But beware: - HDF5 won’t enforce concurrency guarantees - You must handle locking or versioning externally - Reader/writer separation needs care

Summary

You can use HDF5 as part of a circular buffer system, especially by leveraging the core virtual file driver with a shared memory mapping. But in practice:

Feature Viable with HDF5?
In-memory datasets ✅ (core VFD)
Shared memory usage ✅ with mmap/IPC
Circular overwrite ❌ manual logic
Multi-process safety ⚠️ external sync
Schema reuse ✅ seamless

For production pipelines, I prefer: - Boost + ZMQ + Protobuf/Thrift for live data - HDF5 for archival and structured persistence

The two worlds meet cleanly if you manage the boundary carefully.

— Steven Varga

Structuring Historical Options Data in HDF5—H5CPP Tips

The Problem (Dan E, Oct 27, 2014)

I have equity options historical data in daily CSVs—one file per day with around 700k rows for ~4k symbols. Each row includes a dozen fields like symbol, strike, maturity, open/high/low/close, volume, etc. I want fast access for both:

  1. All options on a given day for a symbol or list of symbols
  2. Time series for a specific option across multiple dates

My current approach builds a Pandas DataFrame for each day and stores it under an 'OPTIONS' group in HDF5. But accessing a few symbols loads the entire day’s worth of data—huge overhead. And fetching a specific contract across many days means loading many files.

How should I structure this? Use one big file or many? Hierarchy? And any recommendations for Python access (like Pandas)?

H5CPP Wisdom (Steven Varga, Oct 27, 2014)

Steve offers a templated, high-performance structure—tailored for daily partitioning and fast indexing:

  1. Hash symbols to numeric indices, and use those for indexing instead of strings.
  2. Name data blocks by date—e.g., 2014-01-01.mat—so each day's data is self-contained.
  3. Enable chunking and high compression to balance I/O throughput and file size.
  4. Treat irregular time series (e.g., tick-by-tick events) differently than regular ones:
  5. Use HDF5 custom datatypes with "pocket tables" for compact, sequential access of irregular data.
  6. For regular time series (e.g., OHLC candles), use dense N-dimensional slabs (e.g. [instrument, time, OHLC]) with float is efficient.
  7. If you’re running this on a parallel filesystem with MPI and PHDF5, you can achieve throughput and storage efficiency that rivals—and may surpass—SQL systems.

H5CPP Approach in Practice

Layout Strategy

/options/
2014-01-01.mat       ← daily file
... <Date>.mat             parameterized by date

Inside a daily file:

  • Datasets keyed by hashed symbol IDs or structured arrays.
  • Regular series: stored as compact multidimensional slabs.
  • Irregular data: structured as compact “pocket tables” with custom datatypes.

Python Access Pattern

# Pseudocode using h5py or H5CPP-Python bindings
with h5py.File("2014-01-01.mat", "r") as f:
    data = f['options'][symbol_id]  # fast direct index
````

For a time-series across dates:

```python
df_series = []
for date in dates:
    with h5py.File(f"{date}.mat", "r") as f:
        df_series.append(f['options'][symbol_id])

# Combine into one timeline

Benefits

  • Selective reads—fetch only what's needed, e.g. a single symbol per day.
  • Efficient storage—chunked and compressed format minimizes disk footprint.
  • Scalable throughput—especially when using MPI + PHDF5 on parallel filesystems.
  • Language-agnostic—H5CPP’s type mappings and structuring make it accessible from C++, Python, Julia, etc.