Skip to content

cgarryZA/lead_lag_structure_analysis

Repository files navigation

Lead–Lag Structure Analysis

This repository provides a pure, research-grade diagnostic for identifying and characterising lead–lag structure between two pre-processed time series.

It answers questions of the form:

“Does variation in series A tend to precede variation in series B, and over what lag window?”

The output is descriptive, not predictive. No trading assumptions, feasibility constraints, or execution logic are included.

Scope and Philosophy

This codebase is intentionally narrow in scope.

It is designed to:

Reveal temporal association structure

Characterise directionality (A → B vs B → A)

Describe where in lag space the association lives

Summarise shape, spread, and decay of the effect

It is not designed to:

Construct or validate trading strategies

Choose holding periods

Model execution, fees, or liquidity

Produce forecasts or P&L

Claim causality

Those questions belong downstream, in separate systems.

What This Repository Does

Given four pre-processed series:

signal_ab: signal driving A → B

target_ab: target for A → B

signal_ba: signal driving B → A

target_ba: target for B → A

(all aligned, cleaned, and transformed upstream)

the repository:

Computes Information Coefficients (ICs) across a lag range

Runs the analysis in both directions:

A → B

B → A

Supports two complementary modes:

Exploratory (shape inspection)

HAC-adjusted (inference under serial dependence)

Extracts lag-structure diagnostics:

Peak lag

Peak lag region

Integrated IC

Normalised IC mass

IC centroid

Decay half-life

Optionally produces:

IC plots

Structured JSON output

What This Repository Does Not Do

This repository does not:

Align time series

Handle missing data

Compute returns

Define forward horizons

Decide holding periods

Apply filters or transforms

Optimise parameters

Backtest strategies

All inputs must be prepared upstream.

This is a structural diagnostic, not a pipeline.

Core Methodology Information Coefficient (IC)

For each lag 𝑘 k:

IC(k) = corr(signal_{t−k}, target_t)

Supported correlation types:

Spearman (rank-based)

Pearson (level-based)

Lag units are assumed to be uniform (e.g. days).

Lag 0 Semantics

Lag 0 is treated explicitly as a contemporaneous baseline:

It represents synchronous movement

It is not interpreted as lead–lag causality

All structural metrics exclude lag 0 by design

This avoids accidental misuse of synchronous correlation as directional evidence.

Exploratory vs HAC Modes

Two modes are intentionally supported:

Exploratory

Computes IC values only

Fast and lightweight

Intended for shape inspection

HAC (Newey–West)

Computes standard errors and confidence intervals

Accounts for serial dependence

Intended for honest inference

Both modes operate on the same inputs and lag grid.

Peak Region Detection

Rather than selecting a single lag via argmax, the code identifies contiguous lag regions that satisfy criteria such as:

Fraction of peak IC

Statistical significance

Minimum width

This avoids over-interpreting noisy point estimates and highlights whether the effect is:

Sharp and localised, or

Broad and distributed

Integrated IC

Integrated IC measures the total IC mass over a lag window:

Raw integrated IC: sum of ICs

Normalised integrated IC: fraction of total absolute IC mass

This answers:

How much of the overall lead–lag structure lives in the peak region?

IC Centroid

The IC centroid provides a centre-of-mass estimate in lag space:

centroid = Σ(k · IC(k)) / Σ(IC(k))

It represents the effective timing of information transmission.

Decay Metrics

IC is treated as a decay curve over lag:

Peak lag

Peak IC

Half-life after the peak

These are descriptive diagnostics only, not forecasts.

Directional Asymmetry

The repository explicitly compares:

A → B

B → A

over the same lag window.

Directional asymmetry is summarised as:

ΔIC = IntegratedIC(A → B) − IntegratedIC(B → A)

This highlights dominant directionality without claiming causation.

Outputs

If an output_dir is provided, the analysis produces:

output_dir/ ├── A_to_B_exploratory.png ├── A_to_B_hac.png ├── B_to_A_exploratory.png ├── B_to_A_hac.png └── results.json

Plots are optional and purely illustrative

results.json is structured and machine-readable

The main function always returns results as a Python dict

File output is optional, enabling clean in-memory pipelines.

Intended Usage Pattern

This repository is designed to sit between:

an upstream data / transform layer, and

a downstream modelling or strategy layer

Typical flow:

Raw Data ↓ Question-specific preprocessing ↓ Lead–Lag Structure Analysis ← (this repository) ↓ Interpretation / modelling / trading logic

Summary

This repository answers one precise question:

“Is there evidence of a lead–lag relationship, and what does its structure look like?”

It deliberately avoids answering:

“Can this be traded?”

That separation is intentional — and enforced in code.

About

Statistical lead–lag analysis for paired time series with overlapping returns using IC curves and HAC inference.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages