Skip to content

Best Way to Store Tick-by-Tick Stock Data in HDF5

The Problem (OP asks)

I'm storing stock tick-by-tick data—what’s the best structure in HDF5 to handle this efficiently and flexibly?

Perspective from H5CPP (Steven Varga, May 19, 2018)

If you’re working with high-frequency time series (HFT-style), H5CPP is tailor-made for event-based datasets. It plays well with std::vector<> of your POD structs and most linear algebra stacks (e.g., Armadillo).

Two Design Patterns to Consider

​​ Extendable Multi-Dimensional Cube

You can model your tick data as an extendable cube (or even higher dimensions), then slice what you need:

h5::create(fd, {n_minutes_in_day, n_instruments, H5_UNLIMITED}); // creates extendable dataset

Armadillo can pull in cubes directly. H5CPP supports partial read/write, so you can map arbitrary hyperslabs into:

  • arma::cube (3D)
  • arma::mat (2D)
  • std::vector<YourPOD> (1D)

This gives you powerful flexibility if you want to access time-instrument windows efficiently. A single multi-dimensional dataset can model all your data—but its harder to manage day boundaries or rolling windows.

So instead, keep things simple:

  • Use one daily stream for your high-frequency irregular time series (IRTS)—packed as a 1D sequence.
  • Use daily matrices for regular time series (RTS) when appropriate.
  • Then stitch these together using a circular buffer to model your clock window (e.g., last N days).

That gives you clean separation for day-level operations, easy rolling windows across days, and better control inside your streaming logic.

Avoid Tables for Tick Data

HDF5 tables (i.e., compound heterogeneous datasets) are cumbersome:

  • Hard to use in linear algebra contexts
  • Not easily mapped to vectors or matrices
  • Often less performant for numerical processing

Stick with floats or doubles in a numeric dataset, and keep your tick schema POD-friendly.

Summary of Approaches

Pattern Use Case H5CPP-Enabled Benefits
Extendable Cube Multi-dimensional, flexible slicing Easy high-D slicing, integrates with arma/Stl
Daily Stream + Circular Day-partitioned, rolling windows Simpler structure, rolling window stitching
HDF5 Tables (compound) Rich schema, less focus on math Clunky for numeric analysis, avoid if possible

TL;DR

H5CPP shines for stock tick data. Model your data as:

  • A hybrid of daily streams and rolling buffers, or
  • A flexible extendable cube if you want high-D slices.

Either way, avoid compound table datasets for numeric speed and maintain clean interop with your analysis stacks.