Best Way to Store Tick-by-Tick Stock Data in HDF5
The Problem (OP asks)
I'm storing stock tick-by-tick data—what’s the best structure in HDF5 to handle this efficiently and flexibly?
Perspective from H5CPP (Steven Varga, May 19, 2018)
If you’re working with high-frequency time series (HFT-style), H5CPP is tailor-made for event-based datasets. It plays well with std::vector<>
of your POD structs and most linear algebra stacks (e.g., Armadillo).
Two Design Patterns to Consider
Extendable Multi-Dimensional Cube
You can model your tick data as an extendable cube (or even higher dimensions), then slice what you need:
Armadillo can pull in cubes directly. H5CPP supports partial read/write, so you can map arbitrary hyperslabs into:
arma::cube
(3D)arma::mat
(2D)std::vector<YourPOD>
(1D)
This gives you powerful flexibility if you want to access time-instrument windows efficiently. A single multi-dimensional dataset can model all your data—but its harder to manage day boundaries or rolling windows.
So instead, keep things simple:
- Use one daily stream for your high-frequency irregular time series (IRTS)—packed as a 1D sequence.
- Use daily matrices for regular time series (RTS) when appropriate.
- Then stitch these together using a circular buffer to model your clock window (e.g., last N days).
That gives you clean separation for day-level operations, easy rolling windows across days, and better control inside your streaming logic.
Avoid Tables for Tick Data
HDF5 tables (i.e., compound heterogeneous datasets) are cumbersome:
- Hard to use in linear algebra contexts
- Not easily mapped to vectors or matrices
- Often less performant for numerical processing
Stick with floats or doubles in a numeric dataset, and keep your tick schema POD-friendly.
Summary of Approaches
Pattern | Use Case | H5CPP-Enabled Benefits |
---|---|---|
Extendable Cube | Multi-dimensional, flexible slicing | Easy high-D slicing, integrates with arma/Stl |
Daily Stream + Circular | Day-partitioned, rolling windows | Simpler structure, rolling window stitching |
HDF5 Tables (compound) | Rich schema, less focus on math | Clunky for numeric analysis, avoid if possible |
TL;DR
H5CPP shines for stock tick data. Model your data as:
- A hybrid of daily streams and rolling buffers, or
- A flexible extendable cube if you want high-D slices.
Either way, avoid compound table datasets for numeric speed and maintain clean interop with your analysis stacks.