Skip to content

Index

Using HDF5 as an In-Memory Circular Buffer

Context

The HDF5 library is a powerful solution for structured data storage, but its default usage assumes durable file-backed I/O. What if we want to use the same layout and tooling for in-memory circular buffers, especially across multiple processes?

This idea came up in a mailing list thread posted by David Schneider in 2016. The question was simple but practical:

“Can I use HDF5 like a circular buffer in memory, with live updates and multiple consumers, using the same schema we already use for on-disk archival?”

That hit close to home—we’d solved a similar problem in our trading systems.

Perspective

We faced the same challenges in building real-time market data pipelines. We needed: - A buffer of recent events in memory (circular structure) - Multi-process access (writer + readers) - A clean way to flush or archive data to disk - The same schema shared across both memory and persistent storage

Instead of twisting HDF5 into a fully-fledged circular buffer, we used HDF5’s virtual file driver (VFD) to great effect.

Solution

1. Boost Ring Buffer + IPC

In systems where latency and determinism matter, we use:

  • Boost's circular_buffer for the in-memory structure
  • ZeroMQ + Protocol Buffers (or Thrift) for pub/sub messaging
  • A fallback mechanism that flushes the buffer’s tail into disk-backed HDF5 for audit or recovery

This gives us: - One writer, many readers (process-safe) - Fault tolerance (via tail-dump or WAL-like shadowing) - Interop with Python, Julia, R via schema-consistent I/O

2. Experimental Mode: One HDF5 File Per Node

In distributed computation (e.g., HPC or large-scale simulations), I skip shared memory entirely: - Run each task independently using serial HDF5 - Use MPI + grid engine orchestration - Merge or reduce results later

The simplicity here avoids shared memory complexity and works well for experimental setups or large batch jobs.

Could You Do It In Pure HDF5?

Yes—using H5Pset_fapl_core() you can instruct HDF5 to treat a memory region as the backing store. In theory, you could:

  1. mmap or shm_open() a fixed-size region
  2. Initialize an HDF5 file layout in that region
  3. Write with wrap-around logic (circular overwrite)
  4. Map other readers to the same memory block

But beware: - HDF5 won’t enforce concurrency guarantees - You must handle locking or versioning externally - Reader/writer separation needs care

Summary

You can use HDF5 as part of a circular buffer system, especially by leveraging the core virtual file driver with a shared memory mapping. But in practice:

Feature Viable with HDF5?
In-memory datasets ✅ (core VFD)
Shared memory usage ✅ with mmap/IPC
Circular overwrite ❌ manual logic
Multi-process safety ⚠️ external sync
Schema reuse ✅ seamless

For production pipelines, I prefer: - Boost + ZMQ + Protobuf/Thrift for live data - HDF5 for archival and structured persistence

The two worlds meet cleanly if you manage the boundary carefully.

— Steven Varga

Structuring Historical Options Data in HDF5—H5CPP Tips

The Problem (Dan E, Oct 27, 2014)

I have equity options historical data in daily CSVs—one file per day with around 700k rows for ~4k symbols. Each row includes a dozen fields like symbol, strike, maturity, open/high/low/close, volume, etc. I want fast access for both:

  1. All options on a given day for a symbol or list of symbols
  2. Time series for a specific option across multiple dates

My current approach builds a Pandas DataFrame for each day and stores it under an 'OPTIONS' group in HDF5. But accessing a few symbols loads the entire day’s worth of data—huge overhead. And fetching a specific contract across many days means loading many files.

How should I structure this? Use one big file or many? Hierarchy? And any recommendations for Python access (like Pandas)?

H5CPP Wisdom (Steven Varga, Oct 27, 2014)

Steve offers a templated, high-performance structure—tailored for daily partitioning and fast indexing:

  1. Hash symbols to numeric indices, and use those for indexing instead of strings.
  2. Name data blocks by date—e.g., 2014-01-01.mat—so each day's data is self-contained.
  3. Enable chunking and high compression to balance I/O throughput and file size.
  4. Treat irregular time series (e.g., tick-by-tick events) differently than regular ones:
  5. Use HDF5 custom datatypes with "pocket tables" for compact, sequential access of irregular data.
  6. For regular time series (e.g., OHLC candles), use dense N-dimensional slabs (e.g. [instrument, time, OHLC]) with float is efficient.
  7. If you’re running this on a parallel filesystem with MPI and PHDF5, you can achieve throughput and storage efficiency that rivals—and may surpass—SQL systems.

H5CPP Approach in Practice

Layout Strategy

/options/
2014-01-01.mat       ← daily file
... <Date>.mat             parameterized by date

Inside a daily file:

  • Datasets keyed by hashed symbol IDs or structured arrays.
  • Regular series: stored as compact multidimensional slabs.
  • Irregular data: structured as compact “pocket tables” with custom datatypes.

Python Access Pattern

# Pseudocode using h5py or H5CPP-Python bindings
with h5py.File("2014-01-01.mat", "r") as f:
    data = f['options'][symbol_id]  # fast direct index
````

For a time-series across dates:

```python
df_series = []
for date in dates:
    with h5py.File(f"{date}.mat", "r") as f:
        df_series.append(f['options'][symbol_id])

# Combine into one timeline

Benefits

  • Selective reads—fetch only what's needed, e.g. a single symbol per day.
  • Efficient storage—chunked and compressed format minimizes disk footprint.
  • Scalable throughput—especially when using MPI + PHDF5 on parallel filesystems.
  • Language-agnostic—H5CPP’s type mappings and structuring make it accessible from C++, Python, Julia, etc.