Skip to content

evanpeikon/CausalEdge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Screenshot 2025-11-14 at 9 34 44 AM

🧬 Abstract

Network inference from time-series proteomics data remains a challenge in systems biology, with most approaches unable to distinguish direct regulatory relationships from indirect associations mediated by intermediate proteins. CausalEdge addresses this challenge by combining Granger Causality testing with bootstrapped mediation analysis to construct directed causal networks representing direct regulatory relationships. This model operates though a four-stage pipeline:

Screenshot 2025-11-14 at 9 29 26 AM
  • First, CausalEdge performs initial filtering to find protein pairs that vary together, which narrows down candidate pairs for testing.
  • Second, it performs Granger Causality tests for each candidate pair to determine if past values of Protein $A$ help predict future values of Protein $B$ better than Protein $B$'s history alone, establishing temporal precedence with false discovery rate correction.
  • Third, it builds a directed network graph where proteins are represented as nodes and directed edges represent causal relationships, with edges indicating where protein A positively or negatively causally influences changes in protein $B$.
  • Fourth, it performs mediation analysis to determine is observed causal relationships are mediated by another intermediate variable. For example, if protein $A$ causally influecned both protein $B$ and $C$, and protein $B$ causally influences protein $C$, it's possible that protein $A$'s effect on protein $C$ is indirect and mediated by protein $B$. In these cases, conditional Granger Causality tests are performed to assess whether the $A$$C$ relationship becomes non-signficant when controlling for $B$, and if so, the edge between nodes $A$ and $C$ is pruned.

By quantifying the proportion of causal effect mediatd using bootstrap confidence intervals, CausalEdge provides statistically rigorous identification of spurious edges, yielding clearner, high confidence, networks that more accurately reflect direct regulatory relationships. This model is applicable to time-series proteomics data from from traditional high-throughput platforms and emerging wearable molecular biosensors, enabling more interpretable systems-level analysis of protein regulatory dynamics. An important application of CausalEdge is generating condition-specific or subject-specific causal relationship data that can be integrated into biological knowledge graphs. This capability helps address the challenge of molecular moonlighting, where a given protein's function depends on context, by explicitly capturing how protein regulatory roles vary across different biological states.

🧬 Introduction

1.1 The Challenge of Network Inference in Proteomics

Understanding causal and regulatory relationships among proteins is fundamental to systems biology, with applications ranging from disease mechanism elucidation to therapeutic target identification. While high-throughput proteomics technologies have enabled comprehensive measurement of protein abundance across conditions and time points, inferring the underlying regulatory network structure remains challenging. Traditional approaches based on correlation or differential expression analysis can identify co-varying proteins but cannot establish distinguish direct regulatory relationships from indirect associations [1,2].

The distinction between direct and indirect relationships has important implications for biological interpretation. Consider a scenario where protein A regulates protein B, and protein B in turn regulates protein C (A→B→C). Standard network inference methods would typically report edges for A→B, B→C, and A→C. However, the A→C relationship is indirect, mediated entirely through B. Including such indirect edges in network representations obscures the true regulatory architecture, complicates hub identification, and can mislead downstream pathway analysis and therapeutic targeting efforts [3,4].

A related challenge in proteomics is molecular moonlighting; the phenomenon where individual proteins perform different functions depending on cellular context, localization, post-translational modifications, or disease state [24,25]. Static protein interaction databases cannot capture this context-dependence, making it difficult to understand what a specific protein is actually doing in a particular biological condition. Condition-specific causal networks that explicitly represent how protein regulatory relationships vary across contexts provide a solution to this problem, enabling more accurate functional annotation and better integration with biological knowledge graphs.

1.2 Limitations of Existing Approaches

The landscape of network inference methods reveals a persistent gap between what we can measure and what we need to understand. Static correlation-based methods, including partial correlation networks and Gaussian graphical models, excel at identifying conditional independence relationships but lack the temporal information necessary to infer directionality [5,6]. Without knowing which protein's change precedes the other, these methods cannot distinguish regulatory drivers from their downstream targets, nor can they identify mediated relationships that require temporal ordering to detect.

Differential network analysis approaches have emerged to compare networks across conditions, offering insights into context-specific associations [7]. However, these methods inherit the same causal inference limitations as their static counterparts. While they can identify which associations differ between conditions, they cannot determine whether observed changes represent direct regulatory mechanisms or merely downstream consequences of upstream perturbations. This ambiguity becomes particularly problematic when attempting to identify therapeutic targets, where distinguishing between drivers and passengers is critical.

Bayesian network approaches offer a more sophisticated framework, capable of inferring directed acyclic graphs that have been successfully applied to gene expression data [8,9]. These methods can model probabilistic dependencies and infer directionality under certain assumptions. However, most implementations do not explicitly test for or remove mediated edges, treating all inferred dependencies as equally direct. Additionally, the computational complexity of Bayesian network inference scales poorly beyond a few hundred variables, making application to proteome-scale datasets, which may include thousands of proteins, challenging. Dynamic Bayesian networks extend this framework to temporal data but rarely incorporate explicit mediation testing [10], leaving the indirect relationship problem unresolved.

Information-theoretic methods, such as mutual information and transfer entropy, provide another avenue for network inference [11,12]. These approaches can capture nonlinear relationships and infer directionality without assuming specific functional forms, making them attractive for biological systems where regulatory mechanisms may be complex. However, reliable estimation of information-theoretic quantities requires substantial amounts of data—often more than typical proteomics experiments provide. Furthermore, these methods lack straightforward frameworks for statistical significance testing and multiple comparison correction, making it difficult to distinguish true regulatory relationships from noise in finite samples. The absence of established mediation analysis procedures within the information-theoretic framework further limits their ability to identify direct relationships.

Granger causality, originally developed in econometrics and now widely used in neuroscience, offers a complementary approach based on temporal precedence and predictive power [13,14]. The core idea is as follows: if past values of variable X improve prediction of variable Y beyond what Y's own history provides, then X can be said to "Granger-cause" Y. This operational definition of causality aligns well with regulatory biology, where upstream proteins influence the abundance or activity of downstream targets over time. Despite these advantages, applications of Granger causality to proteomics have been limited [15,16], and existing implementations have not addressed the fundamental problem of indirect relationships. When protein A regulates B, and B regulates C, standard Granger causality will detect both the direct relationships (A→B, B→C) and the indirect relationship (A→C), leaving researchers with cluttered networks that obscure the true regulatory architecture.

1.3 Novel Contributions

The persistent challenge of distinguishing direct from indirect relationships in protein regulatory networks motivated the development of CausalEdge, a framework that integrates Granger causality testing with bootstrapped mediation analysis to systematically identify and remove spurious edges. This integration addresses a critical gap: while Granger causality establishes temporal precedence and predictive relationships, it does not inherently distinguish whether A→C represents a direct regulatory mechanism or an indirect effect mediated through intermediate proteins. By explicitly testing for mediation and quantifying the proportion of each relationship that operates through intermediates, CausalEdge enables data-driven decisions about which edges represent direct regulatory mechanisms.

The framework makes several advances beyond existing approaches. Rather than relying on arbitrary topology-based heuristics to prune networks (such as removing edges below a certain correlation threshold or keeping only the top-k strongest relationships), CausalEdge employs statistical hypothesis testing at every stage. Initial Granger causality tests are corrected for multiple comparisons using false discovery rate control, ensuring that the expected proportion of false positives is kept below a specified threshold [19]. Subsequently, for each edge that passes this initial filter, bootstrapped mediation analysis quantifies what proportion of the relationship operates through measured intermediate proteins, complete with confidence intervals that account for the uncertainty in these estimates [22,23]. Edges are pruned only when a substantial and statistically significant proportion of the effect is mediated, with the threshold under user control to match the stringency requirements of different applications.

Additionally, by applying CausalEdge separately to time-series data from different biological conditions such as disease versus healthy states, different drug treatments, individual subjects, or distinct cell types, researchers can generate condition-specific causal networks that explicitly capture how protein regulatory roles change with context. This capability directly addresses the challenge of molecular moonlighting, where the same protein may act as a master regulator in one condition while having minimal regulatory activity in another [24,25]. Rather than averaging across contexts to produce a single "consensus" network, CausalEdge preserves these distinctions, generating structured data suitable for integration into biological knowledge graphs. Each inferred edge carries metadata indicating not just the direction and strength of the relationship, but also the specific condition under which it was observed, the temporal lag at which the effect operates, and quantified statistical confidence.

The resulting networks more accurately represent direct regulatory mechanisms, enabling improved identification of regulatory hubs—proteins that directly control many downstream targets—and regulated integration points where multiple pathways converge. By removing transitive edges that pass through intermediate proteins, network centrality measures and other topological metrics more faithfully reflect the underlying regulatory architecture rather than being inflated by indirect effects. For systems biology applications ranging from disease mechanism elucidation to therapeutic target identification, this distinction between direct and indirect relationships is not just academically interesting but practically essential. Targeting a protein with many direct regulatory connections is fundamentally different from targeting one whose apparent centrality derives primarily from being downstream of a long cascade.

🧬 Methods

2.1 Mathematical Framework

The CausalEdge framework rests on two complementary statistical approaches: Granger causality for establishing temporal precedence and directional influence, and mediation analysis for distinguishing direct effects from those that operate through intermediate variables. Together, these methods enable inference of regulatory network structure from time-series proteomics data.

2.1.1 Granger Causality Testing

The fundamental question in causal inference is whether one variable influences another. For protein time series, this translates to asking whether changes in protein A precede and help predict changes in protein B. Granger causality formalizes this intuition through a comparison of predictive models. Given time series $x_t$ and $y_t$ measured at time points $t=1,2,...T,$ we test whether past values of $x$ improve prediciton of current values of $y$ beyond what $y$'s own history provides. This test proceeds by fitting two autoregressive models. The restricted model uses only the history of $y$ to predict it's current value:

$y_t = α_0 + \sum_{i=1}^p α_iy_{t-i}+ϵ_t$

The unrestricted model adds the history of $x$ as additional predictors:

$y_t = α_0 + \sum_{i=1}^p α_iy_{t-i} + \sum_{i=1}^p β_ix_{t-i}+η_t$

where $p$ is the maximum lag order tested and $ϵ_t$ and $η_t$ are residual error terms. The null hypothesis that $x$ does not Granger-cause $y$ corresponds to $H_0: β_1 = β_2 = ...= β_p=0$, which is tested via an F-statistic that compares the residual variance of the two models:

$F = \frac{(RSS_r-RSS_u)/p}{RSS_u/(T-2p-1)}$

where $RSS_r$ and $RSS_u$ are the residual sums of squares for the restricted and unrestricted models, respectively. Under the null hypothesis, this statistic follows an F-distribution with degrees of freedom $(p,T-2p-1)$. A large F-statistic indicates that including $x$'s history substantially improves the prediction of $y$, suggesting a causal relationship.

The Granger causality test is implemented in the check_granger_causality() function:

def check_granger_causality(x, y, max_lag=2, f_stat_threshold=10, p_value_threshold=0.05):
    """Test if x Granger-causes y."""
    try:
        xy_data = np.column_stack([y, x]) # stack arrays for bivariate analysis
        mask = ~np.isnan(xy_data).any(axis=1) # remove rows with nan
        xy_data = xy_data[mask]        
        test_result = grangercausalitytests(xy_data, maxlag=max_lag, verbose=False) # run GC across multiple lags
        f_stats = [test_result[lag][0]['ssr_ftest'][0] for lag in range(1, max_lag + 1)] # extract f-stat for each lag
        p_values = [test_result[lag][0]['ssr_ftest'][1] for lag in range(1, max_lag + 1)] # extract p-val for each lag
        
        # select the lag w/ the highest f-stat (strongest effect)
        max_f_index = np.argmax(f_stats)
        best_f = f_stats[max_f_index]
        best_p = p_values[max_f_index]
        best_lag = max_f_index + 1
        is_causal = (best_f > f_stat_threshold) and (best_p < p_value_threshold)
        return is_causal, best_f, best_p, best_lag
    except Exception:
        return False, 0, 1, 0

The choice of lag order $p$ is critical, as it determines how far back in time we look for predictive relationships. Too small a lag may miss regulatory effects that operate over longer timescales, while too large a lag reduces statistical power and may introduce spurious relationships. In practice, CausalEdge tests multiple lags $p[1,2,...,p_{max}]$ and selects the optimal lag, $p^*$, which is the one that yields the highest F-statistic, represetning the timescale at which the regulatory effect is strongest. This data-driven approach allows the temporal dynamics of each protein pair to emerge from the data rather than being imposed a priori.

Finally, we declare that protein X Granger-causes protein Y if the F-statistic exceeds a threshold and the associated p-value is below a significance level after correction for multiple testing (described in Section 2.1.2), establishing both temporal precedence (X's past predicts Y's future) and predictive power (the prediction is statistically significant), providing a foundation for inferring directional regulatory relationships.

2.1.2 False Discovery Rate Correction

Proteomics experiments typically measure hundreds to thousands of proteins, leading to tens of thousands or even millions of potential protein pairs to test. With a standard significance threshold of $p &lt; 0.05$, we would expect 5% of tests to yield false positives purely by chance; a problem that compounds dramatically with the number of tests performed. Without correction, networks inferred from proteomics data would be heavily contaminated with spurious edges.

To address this challenge, the Benjamini-Hochberg procedure for controlling the false discovery rate (FDR) is employed. Unlike family-wise error rate corrections such as Bonferroni, which control the probability of making any false positives, FDR correction controls the expected proportion of false positives among all rejected hypotheses. This distinction provides substantially greater statistical power while still ensuring that the majority of reported relationships are true positives.

Given $N$ candidate protein pairs tested for Granger causality, we obtain p-values $p_1, p_2 ,...,p_N$. The Benjamini-Hochberg procedure operates as follows:

  • First, sort the p-values in ascending order $p_{(1)}≤p_{(2)}≤...≤p_{(n)}$.
  • Second, find the largest index $k$ such thagt $p_{(k)} ≤\frac{k}{N}*α$, where $α$ is the desired FDR level (0.05 by default).
  • Third, reject all null hypotheses corresponding to tests $1,2,...k$, ensuring that the expected proportion of false positives among all rejected hypotheses is at most $α$, providing formal control over the false discovery rate while maintaining reasonable power to detect true relationships.

FDR correction is applied using the multipletests() function from statsmodels after collecting all Granger causality test results:

p_values = [r['p_value'] for r in test_results] # collect p-vals after performing all GC tests
reject, p_adjusted, _, _ = multipletests(p_values, alpha=p_value_threshold, method=fdr_method) # perform FDR correction using BH  method

# add corrected p-val to results
for i, result in enumerate(test_results):
    result['p_adjusted'] = p_adjusted[i]
    result['significant'] = reject[i]

# add only significant edges to graph
for result in test_results:
    if result['significant']:
        G.add_edge(result['cause'], result['effect'], weight=abs(result['correlation']),
            f_stat=result['f_stat'], p_value=result['p_value'], p_adjusted=result['p_adjusted'], lag=result['lag'])

The importance of this correction cannot be overstated. In preliminary analyses without FDR correction, networks inferred from time-series proteomics data contained 2-10x as many edges as networks inferred with proper correction, with the additional edges showing substantially weaker effect sizes and less biological coherence. By applying FDR correction, we ensure that the networks entering the mediation analysis stage represent genuine temporal relationships rather than statistical noise.

2.1.3 Mediation Analysis

Establishing that protein A Granger-causes protein C tells us that A's past helps predict C's future, but it does not tell us whether this effect is direct or mediated through intermediate proteins. This distinction is important for understanding regulatory architecture. Consider a pathway where A regulates B, and B in turn regulates C. Standard Granger causality will detect A→C (since A's past predicts C's future), but this edge is spurious in the sense that A affects C only indirectly through B. Including such indirect edges obscures the true regulatory structure and inflates centrality metrics for upstream proteins.

To address this, causal mediation analuysis is performed, which decomposes the total effect of A on C into direct and indirect components [20,21]. The total effect, denoted c c, represents the overall influence of A on C. This can be partitioned into a direct effect $c^{'}$ (the influence of A on C that does not pass through any measured mediatior B) and an indirect effect $ab$ (the influence that operates through B). The indirect effect is itself a product of two paths: the effect of A on B (the "a-path") and the effect of B on C controlling for A (the "b-path"). These quantities are related by the mediation equation:

$c=c^{'}+ab$

The proportion of the total effect that is mediated through B is:

$ρ = \frac{ab}{c}$

This proportion serves as our key decision variable: if a large fraction of A's effect on C operates through B, then the A→C edge should be considered indirect and removed from the network. To estimate these quantities from time-series data, three regression models are fit for each potential mediation triple (A, B, C). The first model estimates the a-path by regressing B on A and lagged values of B:

$B_t = γ_0 + \sum_{i=1}^p γ_iB_{t-i} + \sum_{i=1}^p a_iA_{t-i} + ϵ_{t}^{(1)}$

The second model estimates both the b-path and the direct effect (c'-path) by regressing C on both A and B, along with lagged values of C:

$C_t = δ_0 + \sum_{i=1}^p δ_iC_{t-i} + \sum_{i=1}^p c_{i}^{'} A_{t-i} + \sum_{i=1}^p b_iB_{t-i} + ϵ_{t}^{(2)}$

The third model estimates the total effect (c-path) by regressing C on A alone:

$C_t = θ_0 \sum_{i=1}^p θ_iC_{t-i} + \sum_{i=1}^p c_iA_{t-i} + ϵ_{t}^{(3)}$

Each path is then estiamted as the mean effect across lags $a=\frac{1}{p}\sigma_{i=1}^p a_i$, $b=\frac{1}{p}\sigma_{i=1}^p b_i$, and similarly for $c$ and $c^{'}$. This averaging accounts for the fact that regulatory effects may be distributed across multiple time lags rather than concentrated at a single lag.

Mediation analysis is implemented in the bootstrap_mediation_analysis() function, which fits three OLS regression models to estimate the a-path, b-path, and total effect:

def bootstrap_mediation_analysis(x, y, z, max_lag=2, n_bootstrap=1000, confidence_level=0.95):
    """Perform bootstrapped mediation analysis to quantify indirect effects"""
    # Create lagged variables
    y_current = y_vals[max_lag:]
    x_lagged = []
    z_lagged = []
    y_lagged = []
    
    for lag in range(1, max_lag + 1):
        x_lagged.append(x_vals[max_lag - lag:n - lag])
        z_lagged.append(z_vals[max_lag - lag:n - lag])
        y_lagged.append(y_vals[max_lag - lag:n - lag])
    
    # Path a: X → M (effect of X on mediator Z)
    X_x = sm.add_constant(np.column_stack(x_lagged + y_lagged))
    model_a = OLS(z_vals[max_lag:], X_x).fit()
    a_coef = np.mean(model_a.params[1:max_lag+1])
    
    # Path b: M → Y (effect of mediator Z on Y, controlling for X)
    X_xz = sm.add_constant(np.column_stack(x_lagged + z_lagged + y_lagged))
    model_b = OLS(y_current, X_xz).fit()
    b_coef = np.mean(model_b.params[max_lag+1:2*max_lag+1])
    
    # Path c: X → Y (total effect)
    X_total = sm.add_constant(np.column_stack(x_lagged + y_lagged))
    model_c = OLS(y_current, X_total).fit()
    c_coef = np.mean(model_c.params[1:max_lag+1])
    
    # Path c': X → Y controlling for M (direct effect)
    c_prime_coef = np.mean(model_b.params[1:max_lag+1])
    
    # Mediation effects
    indirect_effect = a_coef * b_coef
    direct_effect = c_prime_coef
    total_effect = c_coef

The key quantity for determining whether an edge should be pruned is the indirect effect $ab$. However, because this is a product of two estimated coefficients, its sampling distribution is complex and typically non-normal, making standard error propagation unreliable. To obtain valid statistical inference, CasualEdge utilizes bootstrapping.

2.1.4 Bootstrap Confidence Intervals

Bootstrapping provides a nonparametric approach to quantifying uncertainty in complex statistics like the indirect effect [22,23]. The basic idea is to resample the data many times, recompute the statistic of interest for each resample, and use the resulting distribution to construct confidence intervals. For mediation analysis, this approach has been shown to provide better coverage properties than normal-theory methods, particularly for the indirect effect which follows a skewed distribution [23].

For each mediation test, the bootstrap procedure proceeds as follows. First, the time-series is sampled with replacement $B$ times (default $B$=1000), preserving the temporal structure by sampling entire time points together. For each bootstrap sample $b=1,2,...,B,$ CausalEdge fits Models 1 and 2 from Section 2.1.3 to obtain estimates $\hat{a}^{(b)}$ and $\hat{b}^{(b)}$, then calculates the bootstrap replicate of the indirect effect $\hat{ab}^{(b)} = \hat{a}^{(b)}*\hat{b}^{(b)}$. After obtaining $B$ such replicates, CausalEdge constructs the 95% condidence interval using the percentle method. Mediation is considered statistically significant if this confidence interval does not contain zero, ensuring hat we have strong evidence for an indirect effect before using it as grounds to remove an edge from the network.

Bootstrap confidence intervals are computed within the bootstrap_mediation_analysis() function by resampling the data 1000 times:

# bootstrap confidence intervals for indirect effect
bootstrap_indirect = []

for _ in range(n_bootstrap):
    indices = np.random.choice(len(y_current), size=len(y_current), replace=True) # resample w/ replacement (preserves temporal structure)
    
    # resample all data
    y_boot = y_vals[indices]
    x_boot = x_vals[indices]
    z_boot = z_vals[indices]
    
    # recreate lagged variables for bootstrap sample
    # note: [lagged variable creation code omitted for brevity]
    
    # fit models on bootstrap sample
    model_a_boot = OLS(z_boot[max_lag:], X_x_boot).fit()
    a_boot = np.mean(model_a_boot.params[1:max_lag+1])
    model_b_boot = OLS(y_curr_boot, X_xz_boot).fit()
    b_boot = np.mean(model_b_boot.params[max_lag+1:2*max_lag+1])
    
    # stores indirect effect for  bootstrap sample
    bootstrap_indirect.append(a_boot * b_boot)

bootstrap_indirect = np.array(bootstrap_indirect)

# calculate 95% confidence intervals using percentile method
alpha = 1 - confidence_level
ci_lower = np.percentile(bootstrap_indirect, 100 * alpha / 2)
ci_upper = np.percentile(bootstrap_indirect, 100 * (1 - alpha / 2))

# mediation is significant if CI doesn't include 0
is_significant = not (ci_lower <= 0 <= ci_upper)

# calculate proportion mediated
if abs(total_effect) > 1e-10:
    proportion_mediated = abs(indirect_effect / total_effect)
else:
    proportion_mediated = 0

The bootstrap distribution also provides valuable information about the magnitude and variability of mediation effects. A narrow confidence interval indicates that the proportion mediated is precisely estimated, while a wide interval suggests greater uncertainty. This information can inform decisions about edge pruning: edges with highly significant mediation (narrow CI far from zero) can be removed with confidence, while edges with marginally significant mediation warrant more careful consideration.

2.1.5 Edge Pruning Criterion

The final decision about whether to remove an edge from the network integrates the statistical evidence from mediation analysis with user-specified thresholds that reflect the application's stringency requirements. An edge A→C is marked for removal if three conditions are jointly satisfied:

  • First, there exists at least one potential mediator B such that both A→B and B→C are present in the network after Granger causality testing and FDR correction. This ensures that we only consider mediation through proteins that themselves have established temporal relationships with both the cause and effect.
  • Second, the indirect effect through B must be statistically significant, as determined by the bootstrap confidence interval not containing zero. This requirement prevents removal of edges based on mediation estimates that could plausibly be zero due to sampling variability.
  • Third, the proportion of the total effect that is mediated must exceed a user-specified threshold $p&gt;θ$, where $θ∈[0,1]$. The default value of $θ$ is 0.5, which means that edges are removed only if more than half of their effect operates indirectly through a mediator. This threshold provides a balance between network pruning and retention of partially mediated relationships that may reflect parallel direct and indirect mechanisms. Users can adjust this threshold based on their application: conservative analyses aiming to retain most edges might use $θ=0.7$ or higher, while aggressive pruning for hub identification might employ $θ=0.3$ or lower.

The edge pruning logic is implemented in the detect_and_remove_mediated_edges()function, which identifies potential mediators using network topology:

def detect_and_remove_mediated_edges(G, data_transposed, max_lag=2, p_value_threshold=0.05,  mediation_threshold=0.5, n_bootstrap=1000):
    """Detect mediated relationships using bootstrapped mediation analysis"""
    edges_to_remove = []
    
    for cause, effect in G.edges():
        mediators = set(G.successors(cause)) & set(G.predecessors(effect)) #  find potential mediators
        if not mediators:
            continue
        # test ea/ potential mediator w/ bootstrapping
        best_mediator = None
        best_proportion = 0
        best_result = None

        for mediator in mediators:
            result = bootstrap_mediation_analysis(data_dict[cause],data_dict[effect],
                data_dict[mediator], max_lag=max_lag, n_bootstrap=n_bootstrap)
            
            # check if this mediator explains more than previous ones
            if result and result['is_significant'] and result['proportion_mediated'] > best_proportion:
                best_proportion = result['proportion_mediated']
                best_mediator = mediator
                best_result = result
        
        # remove edge if mediation threshold exceeded
        if best_result and best_result['is_significant'] and best_proportion > mediation_threshold:
            edges_to_remove.append((cause, effect))
            mediation_info[(cause, effect)] = { 'is_mediated': True,'mediator': best_mediator, 'proportion_mediated': best_proportion,
            'indirect_effect': best_result['indirect_effect'], 'ci_lower': best_result['ci_lower'], 'ci_upper': best_result['ci_upper']}
    
    # remove mediated edges from graph
    G.remove_edges_from(edges_to_remove)
    return G, mediation_info

This multi-stage criterion ensures that edges are only removed when there is strong statistical evidence for substantial mediation, preventing over-pruning while still effectively identifying and removing spurious indirect relationships. The result is a cleaner network that more accurately represents direct regulatory mechanisms.

2.2 Algorithmic Pipeline

The complete CausalEdge framework translates the mathematical principles described above into a practical computational pipeline consisting of five stages. Each stage builds on the previous one, progressively refining the network from initial correlation-based filtering through final mediation-pruned output.

Stage 1: Data Preprocessing and Correlation Filtering

The pipeline begins with a time-series proteomics data matrix $X∈R^{TxP}$ where $T$ is the number of time points and $P$ is the number of proteins. Real proteomics data typically contain missing values due to detection limits, sample preparation variability, or stochastic sampling in mass spectrometry. To handle these, CausalEdge first removes proteins with more than 50% missing observations, as these provide insufficient information for reliable time-series modeling. For the remaining proteins, CausalEdge imputes missing values using linear interpolation, which assumes that protein abundance changes smoothly between observed time points—a reasonable approximation for most biological processes measured at appropriate temporal resolution.

Data preprocessing and missing value handling are implemented in the main CausalEdge() function:

data_transposed = data.set_index(id_column).T # transpose data so rows are time points and columns are proteins (note: can comment out)

# handle missing values
missing_pct = data_transposed.isnull().mean()
biomarkers_to_keep = missing_pct[missing_pct < 0.5].index
data_transposed = data_transposed[biomarkers_to_keep]

# interpolate remaining missing values
data_transposed = data_transposed.interpolate(method='linear', axis=0, limit_direction='both')

With a complete data matrix in hand, the next step is to identify candidate protein pairs that warrant detailed causality testing. Testing all $P(P-1)$ possible directed pairs would be computationally expensive and unecessary as many protein pairs show no co-variation and are unlikely to have causal relationships. To reduce this burden, CausalEdge calculates pairwise Pearson correlations $r_{ij} = cor(x_i,x_j)$ for all protein pairs and selects only those with absolute correlation above a pre-defined threshold (default = 0.75).

Correlation-based filtering efficiently reduces the candidate pair space using NumPy operations:

corr_array = np.corrcoef(data_array.T) # calculate correlation matrix

# pre filter pairs by correlation threshold
candidate_pairs = []
abs_corr = np.abs(corr_array)
high_corr_mask = abs_corr >= corr_threshold

# exclude self-correlations
for i in range(n_biomarkers):
    high_corr_mask[i, i] = False

# extract high correlation pairs
i_indices, j_indices = np.where(high_corr_mask)

for idx in range(len(i_indices)):
    i, j = i_indices[idx], j_indices[idx]
    candidate_pairs.append((i, j, corr_array[i, j]))

This correlation-based filtering dramatically reduces the number of candidate pairs while retaining those most likely to have genuine regulatory relationships. It is important to emphasize that high correlation does not imply causation; it serves purely as a pre-filter to focus computational effort on promising candidates. The subsequent Granger causality tests will determine directionality and temporal precedence, distinguishing regulatory relationships from co-variation.

Stage 2: Granger Causality Testing with FDR Correction

For each candidate pair $(i,j)$ identified in Stage 1, CausalEdge performs Granger causality testing as described in Section 2.1.1. This involves fitting autoregressive models with and without the potential cause's history, computing an F-statistic that quantifies the improvement in prediction, and calculating the associated p-value. Multiple lag orders are tested and the optimal lag order, that yields the highest F-statistic, is recorded. This optimal lag represents the timescale at which protein $i$'s influence on protein $j$ is strongest, providing information about the dynamics of the regulatory relationship.

For efficiency, protein data arrays are pre-extracted before iterating through candidate pairs:

data_arrays = {i: data_transposed.iloc[:, i].values for i in range(n_biomarkers)} # pre-extract data arrays to avoid repeated DataFrame indexing

# test ea/ candidate pair using the check_granger_causality() function (see Section 2.1.1)
for test_idx, (i, j, corr_value) in enumerate(candidate_pairs):
    is_causal, f_stat, p_value, lag = check_granger_causality(
        data_arrays[i], data_arrays[j], max_lag=max_lag)
    # store results for subsequent FDR correction

CausalEdge also records the sig of the correlation between the protein pair, which is later used to annotate edges as activating (positive sign) or inhibitory (negative sign). While Granger causality itself is agnostic about the sign of the effect, the correlation sign provides a simple heuristic for distinguishing these regulatory modes in most cases.

After testing all candidate pairs, CausalEdge applys Benjamini-Hochberg FDR correction at level α (typically 0.05) as described in Section 2.1.2. his adjusts the p-values to control the expected proportion of false positives among all discoveries. Only protein pairs with FDR-adjusted p-values below α are retained for network construction. This filtering step greatly reduces the number of edges compared to using uncorrected p-values, substantially improving the specificity of the inferred network while maintaining reasonable sensitivity to true relationships.

Stage 3: Network Construction

After compiling a filtered set of Granger-causal relationships, CausalEdge constructs a directed graph $G(=(V,E)$ where nodes $V$ represent proteins and directed edges $E$ repreesnt established causal relationships. Specifically, a directed edge $(i,j)∈E$ is created if protein $i$ Granger-cases protein $j$ with FDR-adjusted statistical signifiance.

Each edge is annotated with attributes that capture important aspects of the relationship. The weight is set to $|r_{ij}|$, the absolute correlation strength, indicating whether the relationship is positive (both proteins increase/decrease together, suggesting activation) or negative (proteins move in opposite directions, suggesting inhibition). The lag is $p_{ij}^{*}$, the optimal temporal lag at which the causal effect is strongest. Finally, CausalEdge stores the statistical test results, including the F-statistic, the original p-value, and the FDR-adjusted p-value, enabling downstream assessment of relationship strength and statistical confidence.

Network construction uses NetworkX to create a directed graph with rich edge annotations:

# initialize directed graph
G = nx.DiGraph()
G.add_nodes_from(biomarker_list)

# after FDR correction, add only significant edges
for result in test_results:
    if result['significant']:  # Based on FDR-adjusted p-value
        G.add_edge(result['cause'], result['effect'],  weight=abs(result['correlation']), correlation=float(result['correlation']),
            color='red' if result['correlation'] < 0 else 'blue',  # Red=inhibitory, Blue=activating f_stat=result['f_stat'],
            p_value=result['p_value'], p_adjusted=result['p_adjusted'], lag=result['lag'])

# remove isolated nodes (no edges)
nodes_with_edges = set()
for u, v in G.edges():
    nodes_with_edges.add(u)
    nodes_with_edges.add(v)

nodes_to_remove = set(G.nodes()) - nodes_with_edges
G.remove_nodes_from(nodes_to_remove)

At this stage, the network represents all significant temporal predictive relationships, but it still contains indirect edges where protein A influences protein C only through intermediate protein B. The next stage addresses this problem through systematic mediation testing.

Stage 4: Mediation Analysis and Edge Pruning

For each edge $(i,k)∈E$, CausalEdge identifies potential mediators using network topology: $M_{ik}=[j:(i,j)∈E \text{ and }(j,k)∈E]$. These are proteins $j$ such that there exists a two-step path from $i$ to $k$ through $j$, making $j$ a candidate for mediating the i→k relationship.

For each potential mediator $j∈M_{ik}$, CausalEdge performs bootstrapped mediation analysis as described in Sections 2.1.3-2.1.4. This involves fitting regression models to estimate the a-path (effect of $i$ on $j$), the b-path (effect of $j$ on $k$ controlling for $i$), and total effect (effect of $i$ on $k$), then using bootstrap resampling to construct confidence intervals for the indirect effect ab. CausalEdge then calcualtes the proportion mediated $p_{ijk} = ab/c$ and determines whether mediation is statistically significant based on whether the confidence interval excludes zero.

If any mediator $j$ satisfies he pruning criteria specified in Section 2.1.5 (significant indirect effect and proportion mediated exceeding threshold θ), CausalEdge marks the edge $(i,k)$ for removal and records hte mediation statistics for downstream analysis and interpretation. This procedure is repeated for all edges in the network. Importantly, CausalEdge using an "any mediator" criterion, meaning that if any single mediator explains a sufficient proportion of the relationship, the edge is removed.

The entire mediation testing and pruning workflow is encapsulated in a single function call (full implementation shown in Section 2.1.5):

# perform mediation analysis if requested
if remove_mediated and G.number_of_edges() > 0:
    G, mediation_info = detect_and_remove_mediated_edges( G, data_transposed,  max_lag=max_lag, 
        p_value_threshold=p_value_threshold, mediation_threshold=mediation_threshold,  # Default 0.5 n_bootstrap=n_bootstrap) 

After testing all edges, those marked for mediation are removed, resulting in the pruned network $G^{'}=(V^{'},E^{'})$. Finally, CausalEdge remoives isolated nodes, which represents proteins that no longer have any incoming or outgoing edges after pruning as these provide no information about regulatory relationships. The result is a final network $G^{'}$ that represents direct regulatory relationships with high statistical confidence.

Stage 5: Network Analysis

After obtaining a pruned network, CausalEdge computes standard network metrics that characterize its structure and identifies biologically important proteins. The pipeline automatically calculates and reports key network statistics:

# calculate in and out degree
out_degree = dict(G.out_degree())
in_degree = dict(G.in_degree())

print("\nTop 5 proteins by out-degree (causes many others):")
sorted_out = sorted(out_degree.items(), key=lambda x: x[1], reverse=True)
for node, degree in sorted_out[:5]:
    if degree > 0:
        print(f"  {node}: {degree} outgoing edges")

print("\nTop 5 proteins by in-degree (influenced by many others):")
sorted_in = sorted(in_degree.items(), key=lambda x: x[1], reverse=True)
for node, degree in sorted_in[:5]:
    if degree > 0:
        print(f"  {node}: {degree} incoming edges")

# identify strongest causal relationships
edges_list = [(u, v, data['f_stat'], data['correlation'],  data['p_value'], data.get('p_adjusted', data['p_value']), data['lag']) for u, v, data in G.edges(data=True)]
edges_list.sort(key=lambda x: x[2], reverse=True)

print("\nTop 10 strongest causal relationships:")
for u, v, f_stat, corr, p_val, p_adj, lag in edges_list[:min(10, len(edges_list))]:
    direction = "→" if corr > 0 else "⊣"
    print(f"  {u} {direction} {v}: F={f_stat:.2f}, p_adj={p_adj:.5f}, lag={lag}")

Additionally, strongly connected components (subgraphs where every node can reach every other node) are identified, which may represent tightly coupled regulatory modules and path analysis is performed to trace regulatory cascades between proteins of interest, providing systems-level insights that complement the individual edge-level information captured during network inference.

2.3 Implementation Details

CausalEdge is implemented in Python 3.8+ using widely-adopted scientific computing libraries. Data manipulation relies on numpy for efficient array operations and pandas for structured data handling. Statistical modeling employs statsmodels for Granger causality testing and regression models, with scipy providing additional statistical functions including bootstrap sampling and percentile calculations. Graph construction and analysis utilize networkx, a library offering comprehensive network algorithms and visualization capabilities.

The implementation uses the following core libraries imported at the top of the script:

import pandas as pd
import numpy as np
import networkx as nx
from statsmodels.tsa.stattools import grangercausalitytests
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.multitest import multipletests

Additionally, the main analysis function provides a user-friendly interface with preset default parameters:

G, causal_df, data_transposed, mediation_info, test_results = CausalEdge(
    data=data,
    id_column='Protein.Names',
    corr_threshold=0.75,           # Correlation pre-filter
    f_stat_threshold=10,           # F-statistic threshold
    p_value_threshold=0.05,        # FDR significance level
    max_lag=2,                     # Maximum temporal lag to test
    remove_mediated=True,          # Enable mediation pruning
    mediation_threshold=0.5,       # Remove if >50% mediated
    n_bootstrap=1000,              # Bootstrap samples for CI
    fdr_method='fdr_bh'            # Benjamini-Hochberg FDR
)

🧬 Discussion

3.1 Advantages of the CausalEdge Framework

The CausalEdge framework addresses several challenges in network inference from time-series proteomics data. First, Granger causality provides temporal precedence, establishing that changes in protein A precede changes in protein B rather than merely correlating with them. This temporal ordering is important for distinguishing regulatory drivers from their downstream targets, a distinction that static correlation-based methods cannot make. When protein A's abundance rises before protein B's, and A's past values improve prediction of B's future values, we have much stronger grounds for inferring that A regulates B than if we merely observe that their abundances correlate across samples.

The explicit removal of mediated edges through bootstrapped mediation analysis represents CausalEdge's core innovation. Existing network inference methods, whether based on correlation, Bayesian networks, or information theory, typically report all significant relationships without distinguishing direct from indirect effects. The result is cluttered networks where upstream proteins appear highly central simply because they sit at the top of long regulatory cascades. By systematically identifying edges whose effects operate primarily through intermediate proteins and removing them, CausalEdge produces networks where centrality metrics genuinely reflect direct regulatory influence.

Rather than making binary decisions about whether to keep or remove edges, CausalEdge quantifies the proportion of each relationship that operates through mediators. This quantification provides valuable biological information. A partially mediated edge, say, 30% indirect, may represent a regulatory relationship where both direct and indirect mechanisms operate in parallel. Perhaps protein A both directly influences C and also influences C indirectly through B, with both mechanisms contributing to the observed correlation. Removing such an edge would discard genuine biological signal. Conversely, a fully mediated edge where 90% or more of the effect operates indirectly is most likely spurious and should be removed. By computing mediation proportions with bootstrap confidence intervals, CausalEdge enables informed decisions that respect biological complexity while still achieving network simplification. Additionally, the combination of FDR correction for multiple testing and bootstrap confidence intervals for mediation effects provides statistical control throughout the inference process.

For systems biology, the ability to generate condition-specific or subject-specific causal networks opens new possibilities for understanding context-dependent regulation. By applying CausalEdge separately to time-series data from different conditions, researchers can ask not just "what regulates what" but "what regulates what in this specific context?" This addresses the challenge of molecular moonlighting directly, revealing that the same protein may function as a master regulator in disease but have minimal regulatory activity in health, or vice versa. Such context-specific networks can be integrated into biological knowledge graphs as condition-annotated edges, enabling sophisticated queries about how protein function varies across biological states and supporting precision medicine approaches where therapeutic strategies must be tailored to individual regulatory architectures.

3.2 Limitations and Assumptions

Like all statistical methods, CausalEdge rests on assumptions that may be violated in real biological systems. The Granger causality framework as implemented here assumes linear autoregressive relationships between protein abundances. While this assumption is computationally tractable and often reasonable as a first-order approximation, true biological regulation can be highly nonlinear. Cooperative binding, threshold effects, and switch-like responses mediated by positive feedback are common in regulatory networks and may not be well-captured by linear models. The result would be missed edges where genuine nonlinear regulatory relationships exist but cannot be detected by linear Granger tests. Extensions using nonlinear prediction methods—neural networks, Gaussian processes, or nonparametric approaches—could address this limitation, though at the cost of increased computational complexity, reduced interpretability, and potentially greater susceptibility to overfitting in the moderate sample sizes typical of proteomics experiments.

CausalEdge assumes that time series are stationary, meaning their statistical properties remain constant over time. This assumption can be violated in experiments with strong trends or oscillations. For example, if protein abundances gradually increase throughout an experiment due to cell growth rather than regulation, or if circadian rhythms introduce periodic dynamics, standard Granger tests may produce spurious causality. Preprocessing steps such as detrending, differencing, or explicitly modeling periodic components can mitigate these issues, though such preprocessing itself introduces assumptions and can alter the interpretation of results.

The framework requires sufficient temporal resolution to capture regulatory dynamics. If sampling intervals are too coarse relative to the speed of protein regulation, true causal relationships may not be detectable because the predictor and outcome appear nearly simultaneous. This is a fundamental limitation of any temporal method: one cannot infer the order of events that occur between measurements. I recommend at least 20 time points for reliable Granger testing, with sampling intervals matched to the timescale of the biological process under study.

Unmeasured confounders represent another challenge for all observational causal inference. If an unmeasured protein Z causes both A and B, Granger causality may incorrectly infer that A causes B (or vice versa) because their correlated changes actually reflect their shared dependence on Z. Mediation analysis partially mitigates this by testing whether relationships can be explained by measured intermediate proteins, but it cannot account for proteins not included in the dataset. This highlights the importance of comprehensive protein coverage: the more complete the proteomics measurements, the less likely that important mediators or confounders are missing. It also suggests that CausalEdge results should be viewed as hypotheses requiring experimental validation rather than definitive proof of causality.

Computational cost can become prohibitive for very large protein panels. Testing all pairs among 1000 proteins yields roughly half a million directed pairs to test, each requiring regression model fitting across multiple lags. Mediation analysis for each surviving edge then requires bootstrapping, multiplying the computational burden. While correlation-based filtering substantially reduces the number of pairs that reach Granger testing, and parallel processing can provide linear speedup, applying CausalEdge to datasets with thousands of proteins and hundreds of time points pushes current computational limits. Future work on more efficient screening procedures or approximations could extend scalability.

Finally, the mediation analysis assumes that effects can be decomposed into separate direct and indirect components without strong interactions between the cause and mediator. If protein A and mediator B have synergistic effects on target C—such that their combined effect is much larger than the sum of their individual effects—the standard mediation model may not accurately represent the biology. More complex mediation models with interaction terms could address this, though estimating interaction effects reliably requires larger sample sizes than typical proteomics experiments provide.

3.3 Applications and Extensions

The immediate application domain for CausalEdge is time-series proteomics measuring hundreds to thousands of proteins across experimental time courses. Drug treatment responses provide a natural use case: by profiling protein dynamics after administering a compound, researchers can map how therapeutic interventions rewire regulatory networks and identify potential mechanisms of action or resistance. Disease progression studies, tracking patients longitudinally as conditions evolve, can reveal regulatory changes that drive pathological transitions, potentially identifying early warning signals or intervention points.

As wearable molecular biosensors mature to enable continuous protein monitoring in vivo, the temporal resolution available for CausalEdge analysis will increase dramatically. Current proteomics experiments typically sample every few hours to days; future biosensors may enable minute-to-minute measurements of key proteins in blood or interstitial fluid. This high-temporal-resolution data will be particularly well-suited for Granger causality analysis, enabling real-time inference of regulatory networks as they respond to meals, exercise, stress, infections, and other perturbations in daily life. The result could be personalized regulatory network maps that capture how an individual's biology responds to their specific environment and lifestyle.

The validation case study highlighted an application with particularly high impact potential: generating condition-specific and subject-specific networks that reveal context-dependent protein function. By applying CausalEdge separately to data from different biological contexts—healthy versus diseased tissue, before versus after treatment, different genetic backgrounds, distinct developmental stages—researchers can build libraries of context-annotated regulatory networks. These networks can be integrated into biological knowledge graphs where proteins are nodes, regulatory relationships are edges, and edge metadata specifies the conditions under which each relationship was observed. This enables queries impossible with static databases: "What does protein X regulate in condition Y?" "Which proteins have the most dramatically different regulatory roles across contexts?" "What regulatory rewiring distinguishes responders from non-responders to therapy?"

Such condition-specific networks directly address molecular moonlighting by making context-dependence explicit rather than averaging it away. A protein might appear as a master regulator with high out-degree in disease networks but have zero out-degree in healthy networks, revealing that its regulatory function is disease-specific. This information could guide therapeutic strategies: targeting such a protein might modulate disease processes while sparing normal physiology, reducing side effects. Conversely, a protein with constitutively high out-degree across all contexts might be a poor therapeutic target if inhibition would disrupt normal function as much as pathological function.

The framework can be extended to multi-omics integration by testing Granger causality across data types. Combining transcriptomics, proteomics, and metabolomics time series would enable inference of multi-layer regulatory networks capturing how changes in mRNA abundance drive changes in protein levels, which in turn affect metabolite concentrations. Such multi-scale networks would reveal regulatory cascades spanning molecular layers, providing a more complete picture of how cells respond to perturbations. The mediation analysis component becomes even more crucial in this setting, as indirect effects propagating across multiple molecular layers can create many spurious edges if not properly pruned.

While the current implementation uses linear autoregressive models for computational tractability, the mediation testing framework is general and could accommodate nonlinear prediction methods. Neural Granger causality, where neural networks replace linear regressions in the Granger test, has shown promise for capturing complex nonlinear dynamics. Gaussian process models offer another route to nonlinearity while quantifying uncertainty. Adapting these methods within the CausalEdge framework would extend applicability to systems with strong nonlinearities, though careful attention to overfitting and model selection would be required given the moderate sample sizes of typical proteomics datasets.

Integration with prior biological knowledge offers another avenue for enhancement. Protein-protein interaction databases, pathway databases, and literature-derived regulatory relationships could be incorporated as priors or constraints. Edges contradicted by known biology—such as regulatory relationships between proteins in different cellular compartments with no known trafficking mechanism—could be flagged for scrutiny or filtered out. Conversely, edges supported by multiple lines of evidence (detected by CausalEdge and present in literature-curated databases) could receive higher confidence scores. This integration would help distinguish genuine novel discoveries from potential artifacts.

🧬 Contributing and Support

CausalEdge is an open-source project and welcomes contributions from the community. If you encounter issues, have suggestions for improvements, or would like to contribute to the project, feel free to reach out: evanpeikon@gmail.com.

About

A Granger Causality and Mediation Framework for Inferring Direct Causal Networks in Time-Series Proteomics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages