October 2014

October 26, 2014
in hdf5, c++
8 min read

Structuring Historical Options Data in HDF5—H5CPP Tips

The Problem (Dan E, Oct 27, 2014)

I have equity options historical data in daily CSVs—one file per day with around 700k rows for ~4k symbols. Each row includes a dozen fields like symbol, strike, maturity, open/high/low/close, volume, etc. I want fast access for both:

All options on a given day for a symbol or list of symbols
Time series for a specific option across multiple dates

My current approach builds a Pandas DataFrame for each day and stores it under an 'OPTIONS' group in HDF5. But accessing a few symbols loads the entire day’s worth of data—huge overhead. And fetching a specific contract across many days means loading many files.

How should I structure this? Use one big file or many? Hierarchy? And any recommendations for Python access (like Pandas)?

H5CPP Wisdom (Steven Varga, Oct 27, 2014)

Steve offers a templated, high-performance structure—tailored for daily partitioning and fast indexing:

Hash symbols to numeric indices, and use those for indexing instead of strings.
Name data blocks by date—e.g., 2014-01-01.mat—so each day's data is self-contained.
Enable chunking and high compression to balance I/O throughput and file size.
Treat irregular time series (e.g., tick-by-tick events) differently than regular ones:
Use HDF5 custom datatypes with "pocket tables" for compact, sequential access of irregular data.
For regular time series (e.g., OHLC candles), use dense N-dimensional slabs (e.g. [instrument, time, OHLC]) with float is efficient.
If you’re running this on a parallel filesystem with MPI and PHDF5, you can achieve throughput and storage efficiency that rivals—and may surpass—SQL systems.

H5CPP Approach in Practice

Layout Strategy

/options/
2014-01-01.mat       ← daily file
... <Date>.mat             parameterized by date

Inside a daily file:

Datasets keyed by hashed symbol IDs or structured arrays.
Regular series: stored as compact multidimensional slabs.
Irregular data: structured as compact “pocket tables” with custom datatypes.

Python Access Pattern

# Pseudocode using h5py or H5CPP-Python bindings
with h5py.File("2014-01-01.mat", "r") as f:
    data = f['options'][symbol_id]  # fast direct index
````

For a time-series across dates:

```python
df_series = []
for date in dates:
    with h5py.File(f"{date}.mat", "r") as f:
        df_series.append(f['options'][symbol_id])

# Combine into one timeline

Benefits

Selective reads—fetch only what's needed, e.g. a single symbol per day.
Efficient storage—chunked and compressed format minimizes disk footprint.
Scalable throughput—especially when using MPI + PHDF5 on parallel filesystems.
Language-agnostic—H5CPP’s type mappings and structuring make it accessible from C++, Python, Julia, etc.