Optimizing HDF5 for Event-Based Time Series Access

Original Question (Tamas, Mar 30, 2017)

I am currently using HDF5 to store my data and I have a structure that is too slow for reading single events.
My first attempt was to have a table of all hits, including the event_id as a field, so a simple table of:

hit_id | event_id | dom_id | time | tot | triggered | pmt_id

This is extremely slow for reading single events, as I have to go through all hits to extract one event.
My second attempt was to create groups for each event, and place the hits in a dataset inside that group. This also turned out to be slow.
So far the only solution I have found is to go back to ROOT, which is a bit of a shame.
Has anyone worked with similar data structures in HDF5? What is the fastest way of reading a single event?

Response (Steven Varga, Mar 31, 2017)

Hello Tamas,

I use HDF5 to store irregular time series (IRTS) data in the financial domain, where performance and storage density both matter. Here’s the approach I’ve been using successfully for the past 5 years:

🧱 Dataset Structure

Data is partitioned per day, each stored as a variable-length dataset (VLEN) containing a stream of events.
The dataset holds a custom compound datatype, tailored for compactness and compatibility.

For example, the record type might look like:

[event_id, asset, price, timestamp, flags, ...]

This layout reduces overhead and improves both sequential read throughput and storage density.

🔧 Access Pattern

The system is designed for write-once, read-many, with sequential reads being the dominant access mode.
For processing, I use C++ iterators to walk through the event stream efficiently.
This low-level backend is then exposed to Julia, R, and Python bindings for analysis pipelines.

Because the datatype is stored directly in the file, it’s also viewable with HDFView for inspection or debugging.

🚀 Performance Context

Runs in an MPI cluster using C++, Julia, and Rcpp.
Delivers excellent performance under both full-stream and filtered-access scenarios.
Chunk size does affect access time significantly—tuning that is essential depending on your use case.

🧠 Optimization Trade-Off

To your question—"how to access all or only some events efficiently"—that’s inherently a bi-objective optimization:

You can gain speed by trading off space, e.g., pre-sorting, duplicating metadata, or chunking strategically.

So while HDF5 might need more careful tuning than ROOT out of the box, it gives you portability and schema clarity across toolchains.

Hope this helps, Steve