Structuring Historical Options Data in HDF5—H5CPP Tips
The Problem (Dan E, Oct 27, 2014)
I have equity options historical data in daily CSVs—one file per day with around 700k rows for ~4k symbols. Each row includes a dozen fields like symbol, strike, maturity, open/high/low/close, volume, etc. I want fast access for both:
- All options on a given day for a symbol or list of symbols
- Time series for a specific option across multiple dates
My current approach builds a Pandas DataFrame for each day and stores it under an 'OPTIONS' group in HDF5. But accessing a few symbols loads the entire day’s worth of data—huge overhead. And fetching a specific contract across many days means loading many files.
How should I structure this? Use one big file or many? Hierarchy? And any recommendations for Python access (like Pandas)?
H5CPP Wisdom (Steven Varga, Oct 27, 2014)
Steve offers a templated, high-performance structure—tailored for daily partitioning and fast indexing:
- Hash symbols to numeric indices, and use those for indexing instead of strings.
- Name data blocks by date—e.g.,
2014-01-01.mat
—so each day's data is self-contained. - Enable chunking and high compression to balance I/O throughput and file size.
- Treat irregular time series (e.g., tick-by-tick events) differently than regular ones:
- Use HDF5 custom datatypes with "pocket tables" for compact, sequential access of irregular data.
- For regular time series (e.g., OHLC candles), use dense N-dimensional slabs (e.g. [instrument, time, OHLC]) with
float
is efficient. - If you’re running this on a parallel filesystem with MPI and PHDF5, you can achieve throughput and storage efficiency that rivals—and may surpass—SQL systems.
H5CPP Approach in Practice
Layout Strategy
Inside a daily file:
- Datasets keyed by hashed symbol IDs or structured arrays.
- Regular series: stored as compact multidimensional slabs.
- Irregular data: structured as compact “pocket tables” with custom datatypes.
Python Access Pattern
# Pseudocode using h5py or H5CPP-Python bindings
with h5py.File("2014-01-01.mat", "r") as f:
data = f['options'][symbol_id] # fast direct index
````
For a time-series across dates:
```python
df_series = []
for date in dates:
with h5py.File(f"{date}.mat", "r") as f:
df_series.append(f['options'][symbol_id])
# Combine into one timeline
Benefits
- Selective reads—fetch only what's needed, e.g. a single symbol per day.
- Efficient storage—chunked and compressed format minimizes disk footprint.
- Scalable throughput—especially when using MPI + PHDF5 on parallel filesystems.
- Language-agnostic—H5CPP’s type mappings and structuring make it accessible from C++, Python, Julia, etc.