Network inference from time-series proteomics data remains a challenge in systems biology, with most approaches unable to distinguish direct regulatory relationships from indirect associations mediated by intermediate proteins. CausalEdge addresses this challenge by combining Granger Causality testing with bootstrapped mediation analysis to construct directed causal networks representing direct regulatory relationships. This model operates though a four-stage pipeline:
- First, CausalEdge performs initial filtering to find protein pairs that vary together, which narrows down candidate pairs for testing.
- Second, it performs Granger Causality tests for each candidate pair to determine if past values of Protein
$A$ help predict future values of Protein$B$ better than Protein$B$ 's history alone, establishing temporal precedence with false discovery rate correction. - Third, it builds a directed network graph where proteins are represented as nodes and directed edges represent causal relationships, with edges indicating where protein A positively or negatively causally influences changes in protein
$B$ . - Fourth, it performs mediation analysis to determine is observed causal relationships are mediated by another intermediate variable. For example, if protein
$A$ causally influecned both protein$B$ and$C$ , and protein$B$ causally influences protein$C$ , it's possible that protein$A$ 's effect on protein$C$ is indirect and mediated by protein$B$ . In these cases, conditional Granger Causality tests are performed to assess whether the$A$ →$C$ relationship becomes non-signficant when controlling for$B$ , and if so, the edge between nodes$A$ and$C$ is pruned.
By quantifying the proportion of causal effect mediatd using bootstrap confidence intervals, CausalEdge provides statistically rigorous identification of spurious edges, yielding clearner, high confidence, networks that more accurately reflect direct regulatory relationships. This model is applicable to time-series proteomics data from from traditional high-throughput platforms and emerging wearable molecular biosensors, enabling more interpretable systems-level analysis of protein regulatory dynamics. An important application of CausalEdge is generating condition-specific or subject-specific causal relationship data that can be integrated into biological knowledge graphs. This capability helps address the challenge of molecular moonlighting, where a given protein's function depends on context, by explicitly capturing how protein regulatory roles vary across different biological states.
Understanding causal and regulatory relationships among proteins is fundamental to systems biology, with applications ranging from disease mechanism elucidation to therapeutic target identification. While high-throughput proteomics technologies have enabled comprehensive measurement of protein abundance across conditions and time points, inferring the underlying regulatory network structure remains challenging. Traditional approaches based on correlation or differential expression analysis can identify co-varying proteins but cannot establish distinguish direct regulatory relationships from indirect associations [1,2].
The distinction between direct and indirect relationships has important implications for biological interpretation. Consider a scenario where protein A regulates protein B, and protein B in turn regulates protein C (A→B→C). Standard network inference methods would typically report edges for A→B, B→C, and A→C. However, the A→C relationship is indirect, mediated entirely through B. Including such indirect edges in network representations obscures the true regulatory architecture, complicates hub identification, and can mislead downstream pathway analysis and therapeutic targeting efforts [3,4].
A related challenge in proteomics is molecular moonlighting; the phenomenon where individual proteins perform different functions depending on cellular context, localization, post-translational modifications, or disease state [24,25]. Static protein interaction databases cannot capture this context-dependence, making it difficult to understand what a specific protein is actually doing in a particular biological condition. Condition-specific causal networks that explicitly represent how protein regulatory relationships vary across contexts provide a solution to this problem, enabling more accurate functional annotation and better integration with biological knowledge graphs.
The landscape of network inference methods reveals a persistent gap between what we can measure and what we need to understand. Static correlation-based methods, including partial correlation networks and Gaussian graphical models, excel at identifying conditional independence relationships but lack the temporal information necessary to infer directionality [5,6]. Without knowing which protein's change precedes the other, these methods cannot distinguish regulatory drivers from their downstream targets, nor can they identify mediated relationships that require temporal ordering to detect.
Differential network analysis approaches have emerged to compare networks across conditions, offering insights into context-specific associations [7]. However, these methods inherit the same causal inference limitations as their static counterparts. While they can identify which associations differ between conditions, they cannot determine whether observed changes represent direct regulatory mechanisms or merely downstream consequences of upstream perturbations. This ambiguity becomes particularly problematic when attempting to identify therapeutic targets, where distinguishing between drivers and passengers is critical.
Bayesian network approaches offer a more sophisticated framework, capable of inferring directed acyclic graphs that have been successfully applied to gene expression data [8,9]. These methods can model probabilistic dependencies and infer directionality under certain assumptions. However, most implementations do not explicitly test for or remove mediated edges, treating all inferred dependencies as equally direct. Additionally, the computational complexity of Bayesian network inference scales poorly beyond a few hundred variables, making application to proteome-scale datasets, which may include thousands of proteins, challenging. Dynamic Bayesian networks extend this framework to temporal data but rarely incorporate explicit mediation testing [10], leaving the indirect relationship problem unresolved.
Information-theoretic methods, such as mutual information and transfer entropy, provide another avenue for network inference [11,12]. These approaches can capture nonlinear relationships and infer directionality without assuming specific functional forms, making them attractive for biological systems where regulatory mechanisms may be complex. However, reliable estimation of information-theoretic quantities requires substantial amounts of data—often more than typical proteomics experiments provide. Furthermore, these methods lack straightforward frameworks for statistical significance testing and multiple comparison correction, making it difficult to distinguish true regulatory relationships from noise in finite samples. The absence of established mediation analysis procedures within the information-theoretic framework further limits their ability to identify direct relationships.
Granger causality, originally developed in econometrics and now widely used in neuroscience, offers a complementary approach based on temporal precedence and predictive power [13,14]. The core idea is as follows: if past values of variable X improve prediction of variable Y beyond what Y's own history provides, then X can be said to "Granger-cause" Y. This operational definition of causality aligns well with regulatory biology, where upstream proteins influence the abundance or activity of downstream targets over time. Despite these advantages, applications of Granger causality to proteomics have been limited [15,16], and existing implementations have not addressed the fundamental problem of indirect relationships. When protein A regulates B, and B regulates C, standard Granger causality will detect both the direct relationships (A→B, B→C) and the indirect relationship (A→C), leaving researchers with cluttered networks that obscure the true regulatory architecture.
The persistent challenge of distinguishing direct from indirect relationships in protein regulatory networks motivated the development of CausalEdge, a framework that integrates Granger causality testing with bootstrapped mediation analysis to systematically identify and remove spurious edges. This integration addresses a critical gap: while Granger causality establishes temporal precedence and predictive relationships, it does not inherently distinguish whether A→C represents a direct regulatory mechanism or an indirect effect mediated through intermediate proteins. By explicitly testing for mediation and quantifying the proportion of each relationship that operates through intermediates, CausalEdge enables data-driven decisions about which edges represent direct regulatory mechanisms.
The framework makes several advances beyond existing approaches. Rather than relying on arbitrary topology-based heuristics to prune networks (such as removing edges below a certain correlation threshold or keeping only the top-k strongest relationships), CausalEdge employs statistical hypothesis testing at every stage. Initial Granger causality tests are corrected for multiple comparisons using false discovery rate control, ensuring that the expected proportion of false positives is kept below a specified threshold [19]. Subsequently, for each edge that passes this initial filter, bootstrapped mediation analysis quantifies what proportion of the relationship operates through measured intermediate proteins, complete with confidence intervals that account for the uncertainty in these estimates [22,23]. Edges are pruned only when a substantial and statistically significant proportion of the effect is mediated, with the threshold under user control to match the stringency requirements of different applications.
Additionally, by applying CausalEdge separately to time-series data from different biological conditions such as disease versus healthy states, different drug treatments, individual subjects, or distinct cell types, researchers can generate condition-specific causal networks that explicitly capture how protein regulatory roles change with context. This capability directly addresses the challenge of molecular moonlighting, where the same protein may act as a master regulator in one condition while having minimal regulatory activity in another [24,25]. Rather than averaging across contexts to produce a single "consensus" network, CausalEdge preserves these distinctions, generating structured data suitable for integration into biological knowledge graphs. Each inferred edge carries metadata indicating not just the direction and strength of the relationship, but also the specific condition under which it was observed, the temporal lag at which the effect operates, and quantified statistical confidence.
The resulting networks more accurately represent direct regulatory mechanisms, enabling improved identification of regulatory hubs—proteins that directly control many downstream targets—and regulated integration points where multiple pathways converge. By removing transitive edges that pass through intermediate proteins, network centrality measures and other topological metrics more faithfully reflect the underlying regulatory architecture rather than being inflated by indirect effects. For systems biology applications ranging from disease mechanism elucidation to therapeutic target identification, this distinction between direct and indirect relationships is not just academically interesting but practically essential. Targeting a protein with many direct regulatory connections is fundamentally different from targeting one whose apparent centrality derives primarily from being downstream of a long cascade.
The CausalEdge framework rests on two complementary statistical approaches: Granger causality for establishing temporal precedence and directional influence, and mediation analysis for distinguishing direct effects from those that operate through intermediate variables. Together, these methods enable inference of regulatory network structure from time-series proteomics data.
The fundamental question in causal inference is whether one variable influences another. For protein time series, this translates to asking whether changes in protein A precede and help predict changes in protein B. Granger causality formalizes this intuition through a comparison of predictive models. Given time series
The unrestricted model adds the history of
where
where
The Granger causality test is implemented in the check_granger_causality() function:
def check_granger_causality(x, y, max_lag=2, f_stat_threshold=10, p_value_threshold=0.05):
"""Test if x Granger-causes y."""
try:
xy_data = np.column_stack([y, x]) # stack arrays for bivariate analysis
mask = ~np.isnan(xy_data).any(axis=1) # remove rows with nan
xy_data = xy_data[mask]
test_result = grangercausalitytests(xy_data, maxlag=max_lag, verbose=False) # run GC across multiple lags
f_stats = [test_result[lag][0]['ssr_ftest'][0] for lag in range(1, max_lag + 1)] # extract f-stat for each lag
p_values = [test_result[lag][0]['ssr_ftest'][1] for lag in range(1, max_lag + 1)] # extract p-val for each lag
# select the lag w/ the highest f-stat (strongest effect)
max_f_index = np.argmax(f_stats)
best_f = f_stats[max_f_index]
best_p = p_values[max_f_index]
best_lag = max_f_index + 1
is_causal = (best_f > f_stat_threshold) and (best_p < p_value_threshold)
return is_causal, best_f, best_p, best_lag
except Exception:
return False, 0, 1, 0The choice of lag order
Finally, we declare that protein X Granger-causes protein Y if the F-statistic exceeds a threshold and the associated p-value is below a significance level after correction for multiple testing (described in Section 2.1.2), establishing both temporal precedence (X's past predicts Y's future) and predictive power (the prediction is statistically significant), providing a foundation for inferring directional regulatory relationships.
Proteomics experiments typically measure hundreds to thousands of proteins, leading to tens of thousands or even millions of potential protein pairs to test. With a standard significance threshold of
To address this challenge, the Benjamini-Hochberg procedure for controlling the false discovery rate (FDR) is employed. Unlike family-wise error rate corrections such as Bonferroni, which control the probability of making any false positives, FDR correction controls the expected proportion of false positives among all rejected hypotheses. This distinction provides substantially greater statistical power while still ensuring that the majority of reported relationships are true positives.
Given
- First, sort the p-values in ascending order
$p_{(1)}≤p_{(2)}≤...≤p_{(n)}$ . - Second, find the largest index
$k$ such thagt$p_{(k)} ≤\frac{k}{N}*α$ , where$α$ is the desired FDR level (0.05 by default). - Third, reject all null hypotheses corresponding to tests
$1,2,...k$ , ensuring that the expected proportion of false positives among all rejected hypotheses is at most$α$ , providing formal control over the false discovery rate while maintaining reasonable power to detect true relationships.
FDR correction is applied using the multipletests() function from statsmodels after collecting all Granger causality test results:
p_values = [r['p_value'] for r in test_results] # collect p-vals after performing all GC tests
reject, p_adjusted, _, _ = multipletests(p_values, alpha=p_value_threshold, method=fdr_method) # perform FDR correction using BH method
# add corrected p-val to results
for i, result in enumerate(test_results):
result['p_adjusted'] = p_adjusted[i]
result['significant'] = reject[i]
# add only significant edges to graph
for result in test_results:
if result['significant']:
G.add_edge(result['cause'], result['effect'], weight=abs(result['correlation']),
f_stat=result['f_stat'], p_value=result['p_value'], p_adjusted=result['p_adjusted'], lag=result['lag'])The importance of this correction cannot be overstated. In preliminary analyses without FDR correction, networks inferred from time-series proteomics data contained 2-10x as many edges as networks inferred with proper correction, with the additional edges showing substantially weaker effect sizes and less biological coherence. By applying FDR correction, we ensure that the networks entering the mediation analysis stage represent genuine temporal relationships rather than statistical noise.
Establishing that protein A Granger-causes protein C tells us that A's past helps predict C's future, but it does not tell us whether this effect is direct or mediated through intermediate proteins. This distinction is important for understanding regulatory architecture. Consider a pathway where A regulates B, and B in turn regulates C. Standard Granger causality will detect A→C (since A's past predicts C's future), but this edge is spurious in the sense that A affects C only indirectly through B. Including such indirect edges obscures the true regulatory structure and inflates centrality metrics for upstream proteins.
To address this, causal mediation analuysis is performed, which decomposes the total effect of A on C into direct and indirect components [20,21]. The total effect, denoted c
c, represents the overall influence of A on C. This can be partitioned into a direct effect
The proportion of the total effect that is mediated through B is:
This proportion serves as our key decision variable: if a large fraction of A's effect on C operates through B, then the A→C edge should be considered indirect and removed from the network. To estimate these quantities from time-series data, three regression models are fit for each potential mediation triple (A, B, C). The first model estimates the a-path by regressing B on A and lagged values of B:
The second model estimates both the b-path and the direct effect (c'-path) by regressing C on both A and B, along with lagged values of C:
The third model estimates the total effect (c-path) by regressing C on A alone:
Each path is then estiamted as the mean effect across lags
Mediation analysis is implemented in the bootstrap_mediation_analysis() function, which fits three OLS regression models to estimate the a-path, b-path, and total effect:
def bootstrap_mediation_analysis(x, y, z, max_lag=2, n_bootstrap=1000, confidence_level=0.95):
"""Perform bootstrapped mediation analysis to quantify indirect effects"""
# Create lagged variables
y_current = y_vals[max_lag:]
x_lagged = []
z_lagged = []
y_lagged = []
for lag in range(1, max_lag + 1):
x_lagged.append(x_vals[max_lag - lag:n - lag])
z_lagged.append(z_vals[max_lag - lag:n - lag])
y_lagged.append(y_vals[max_lag - lag:n - lag])
# Path a: X → M (effect of X on mediator Z)
X_x = sm.add_constant(np.column_stack(x_lagged + y_lagged))
model_a = OLS(z_vals[max_lag:], X_x).fit()
a_coef = np.mean(model_a.params[1:max_lag+1])
# Path b: M → Y (effect of mediator Z on Y, controlling for X)
X_xz = sm.add_constant(np.column_stack(x_lagged + z_lagged + y_lagged))
model_b = OLS(y_current, X_xz).fit()
b_coef = np.mean(model_b.params[max_lag+1:2*max_lag+1])
# Path c: X → Y (total effect)
X_total = sm.add_constant(np.column_stack(x_lagged + y_lagged))
model_c = OLS(y_current, X_total).fit()
c_coef = np.mean(model_c.params[1:max_lag+1])
# Path c': X → Y controlling for M (direct effect)
c_prime_coef = np.mean(model_b.params[1:max_lag+1])
# Mediation effects
indirect_effect = a_coef * b_coef
direct_effect = c_prime_coef
total_effect = c_coefThe key quantity for determining whether an edge should be pruned is the indirect effect
Bootstrapping provides a nonparametric approach to quantifying uncertainty in complex statistics like the indirect effect [22,23]. The basic idea is to resample the data many times, recompute the statistic of interest for each resample, and use the resulting distribution to construct confidence intervals. For mediation analysis, this approach has been shown to provide better coverage properties than normal-theory methods, particularly for the indirect effect which follows a skewed distribution [23].
For each mediation test, the bootstrap procedure proceeds as follows. First, the time-series is sampled with replacement
Bootstrap confidence intervals are computed within the bootstrap_mediation_analysis() function by resampling the data 1000 times:
# bootstrap confidence intervals for indirect effect
bootstrap_indirect = []
for _ in range(n_bootstrap):
indices = np.random.choice(len(y_current), size=len(y_current), replace=True) # resample w/ replacement (preserves temporal structure)
# resample all data
y_boot = y_vals[indices]
x_boot = x_vals[indices]
z_boot = z_vals[indices]
# recreate lagged variables for bootstrap sample
# note: [lagged variable creation code omitted for brevity]
# fit models on bootstrap sample
model_a_boot = OLS(z_boot[max_lag:], X_x_boot).fit()
a_boot = np.mean(model_a_boot.params[1:max_lag+1])
model_b_boot = OLS(y_curr_boot, X_xz_boot).fit()
b_boot = np.mean(model_b_boot.params[max_lag+1:2*max_lag+1])
# stores indirect effect for bootstrap sample
bootstrap_indirect.append(a_boot * b_boot)
bootstrap_indirect = np.array(bootstrap_indirect)
# calculate 95% confidence intervals using percentile method
alpha = 1 - confidence_level
ci_lower = np.percentile(bootstrap_indirect, 100 * alpha / 2)
ci_upper = np.percentile(bootstrap_indirect, 100 * (1 - alpha / 2))
# mediation is significant if CI doesn't include 0
is_significant = not (ci_lower <= 0 <= ci_upper)
# calculate proportion mediated
if abs(total_effect) > 1e-10:
proportion_mediated = abs(indirect_effect / total_effect)
else:
proportion_mediated = 0The bootstrap distribution also provides valuable information about the magnitude and variability of mediation effects. A narrow confidence interval indicates that the proportion mediated is precisely estimated, while a wide interval suggests greater uncertainty. This information can inform decisions about edge pruning: edges with highly significant mediation (narrow CI far from zero) can be removed with confidence, while edges with marginally significant mediation warrant more careful consideration.
The final decision about whether to remove an edge from the network integrates the statistical evidence from mediation analysis with user-specified thresholds that reflect the application's stringency requirements. An edge A→C is marked for removal if three conditions are jointly satisfied:
- First, there exists at least one potential mediator B such that both A→B and B→C are present in the network after Granger causality testing and FDR correction. This ensures that we only consider mediation through proteins that themselves have established temporal relationships with both the cause and effect.
- Second, the indirect effect through B must be statistically significant, as determined by the bootstrap confidence interval not containing zero. This requirement prevents removal of edges based on mediation estimates that could plausibly be zero due to sampling variability.
- Third, the proportion of the total effect that is mediated must exceed a user-specified threshold
$p>θ$ , where$θ∈[0,1]$ . The default value of$θ$ is 0.5, which means that edges are removed only if more than half of their effect operates indirectly through a mediator. This threshold provides a balance between network pruning and retention of partially mediated relationships that may reflect parallel direct and indirect mechanisms. Users can adjust this threshold based on their application: conservative analyses aiming to retain most edges might use$θ=0.7$ or higher, while aggressive pruning for hub identification might employ$θ=0.3$ or lower.
The edge pruning logic is implemented in the detect_and_remove_mediated_edges()function, which identifies potential mediators using network topology:
def detect_and_remove_mediated_edges(G, data_transposed, max_lag=2, p_value_threshold=0.05, mediation_threshold=0.5, n_bootstrap=1000):
"""Detect mediated relationships using bootstrapped mediation analysis"""
edges_to_remove = []
for cause, effect in G.edges():
mediators = set(G.successors(cause)) & set(G.predecessors(effect)) # find potential mediators
if not mediators:
continue
# test ea/ potential mediator w/ bootstrapping
best_mediator = None
best_proportion = 0
best_result = None
for mediator in mediators:
result = bootstrap_mediation_analysis(data_dict[cause],data_dict[effect],
data_dict[mediator], max_lag=max_lag, n_bootstrap=n_bootstrap)
# check if this mediator explains more than previous ones
if result and result['is_significant'] and result['proportion_mediated'] > best_proportion:
best_proportion = result['proportion_mediated']
best_mediator = mediator
best_result = result
# remove edge if mediation threshold exceeded
if best_result and best_result['is_significant'] and best_proportion > mediation_threshold:
edges_to_remove.append((cause, effect))
mediation_info[(cause, effect)] = { 'is_mediated': True,'mediator': best_mediator, 'proportion_mediated': best_proportion,
'indirect_effect': best_result['indirect_effect'], 'ci_lower': best_result['ci_lower'], 'ci_upper': best_result['ci_upper']}
# remove mediated edges from graph
G.remove_edges_from(edges_to_remove)
return G, mediation_infoThis multi-stage criterion ensures that edges are only removed when there is strong statistical evidence for substantial mediation, preventing over-pruning while still effectively identifying and removing spurious indirect relationships. The result is a cleaner network that more accurately represents direct regulatory mechanisms.
The complete CausalEdge framework translates the mathematical principles described above into a practical computational pipeline consisting of five stages. Each stage builds on the previous one, progressively refining the network from initial correlation-based filtering through final mediation-pruned output.
The pipeline begins with a time-series proteomics data matrix
Data preprocessing and missing value handling are implemented in the main CausalEdge() function:
data_transposed = data.set_index(id_column).T # transpose data so rows are time points and columns are proteins (note: can comment out)
# handle missing values
missing_pct = data_transposed.isnull().mean()
biomarkers_to_keep = missing_pct[missing_pct < 0.5].index
data_transposed = data_transposed[biomarkers_to_keep]
# interpolate remaining missing values
data_transposed = data_transposed.interpolate(method='linear', axis=0, limit_direction='both')With a complete data matrix in hand, the next step is to identify candidate protein pairs that warrant detailed causality testing. Testing all
Correlation-based filtering efficiently reduces the candidate pair space using NumPy operations:
corr_array = np.corrcoef(data_array.T) # calculate correlation matrix
# pre filter pairs by correlation threshold
candidate_pairs = []
abs_corr = np.abs(corr_array)
high_corr_mask = abs_corr >= corr_threshold
# exclude self-correlations
for i in range(n_biomarkers):
high_corr_mask[i, i] = False
# extract high correlation pairs
i_indices, j_indices = np.where(high_corr_mask)
for idx in range(len(i_indices)):
i, j = i_indices[idx], j_indices[idx]
candidate_pairs.append((i, j, corr_array[i, j]))This correlation-based filtering dramatically reduces the number of candidate pairs while retaining those most likely to have genuine regulatory relationships. It is important to emphasize that high correlation does not imply causation; it serves purely as a pre-filter to focus computational effort on promising candidates. The subsequent Granger causality tests will determine directionality and temporal precedence, distinguishing regulatory relationships from co-variation.
For each candidate pair
For efficiency, protein data arrays are pre-extracted before iterating through candidate pairs:
data_arrays = {i: data_transposed.iloc[:, i].values for i in range(n_biomarkers)} # pre-extract data arrays to avoid repeated DataFrame indexing
# test ea/ candidate pair using the check_granger_causality() function (see Section 2.1.1)
for test_idx, (i, j, corr_value) in enumerate(candidate_pairs):
is_causal, f_stat, p_value, lag = check_granger_causality(
data_arrays[i], data_arrays[j], max_lag=max_lag)
# store results for subsequent FDR correctionCausalEdge also records the sig of the correlation between the protein pair, which is later used to annotate edges as activating (positive sign) or inhibitory (negative sign). While Granger causality itself is agnostic about the sign of the effect, the correlation sign provides a simple heuristic for distinguishing these regulatory modes in most cases.
After testing all candidate pairs, CausalEdge applys Benjamini-Hochberg FDR correction at level α (typically 0.05) as described in Section 2.1.2. his adjusts the p-values to control the expected proportion of false positives among all discoveries. Only protein pairs with FDR-adjusted p-values below α are retained for network construction. This filtering step greatly reduces the number of edges compared to using uncorrected p-values, substantially improving the specificity of the inferred network while maintaining reasonable sensitivity to true relationships.
After compiling a filtered set of Granger-causal relationships, CausalEdge constructs a directed graph
Each edge is annotated with attributes that capture important aspects of the relationship. The weight is set to
Network construction uses NetworkX to create a directed graph with rich edge annotations:
# initialize directed graph
G = nx.DiGraph()
G.add_nodes_from(biomarker_list)
# after FDR correction, add only significant edges
for result in test_results:
if result['significant']: # Based on FDR-adjusted p-value
G.add_edge(result['cause'], result['effect'], weight=abs(result['correlation']), correlation=float(result['correlation']),
color='red' if result['correlation'] < 0 else 'blue', # Red=inhibitory, Blue=activating f_stat=result['f_stat'],
p_value=result['p_value'], p_adjusted=result['p_adjusted'], lag=result['lag'])
# remove isolated nodes (no edges)
nodes_with_edges = set()
for u, v in G.edges():
nodes_with_edges.add(u)
nodes_with_edges.add(v)
nodes_to_remove = set(G.nodes()) - nodes_with_edges
G.remove_nodes_from(nodes_to_remove)At this stage, the network represents all significant temporal predictive relationships, but it still contains indirect edges where protein A influences protein C only through intermediate protein B. The next stage addresses this problem through systematic mediation testing.
For each edge
For each potential mediator
If any mediator
The entire mediation testing and pruning workflow is encapsulated in a single function call (full implementation shown in Section 2.1.5):
# perform mediation analysis if requested
if remove_mediated and G.number_of_edges() > 0:
G, mediation_info = detect_and_remove_mediated_edges( G, data_transposed, max_lag=max_lag,
p_value_threshold=p_value_threshold, mediation_threshold=mediation_threshold, # Default 0.5 n_bootstrap=n_bootstrap) After testing all edges, those marked for mediation are removed, resulting in the pruned network
After obtaining a pruned network, CausalEdge computes standard network metrics that characterize its structure and identifies biologically important proteins. The pipeline automatically calculates and reports key network statistics:
# calculate in and out degree
out_degree = dict(G.out_degree())
in_degree = dict(G.in_degree())
print("\nTop 5 proteins by out-degree (causes many others):")
sorted_out = sorted(out_degree.items(), key=lambda x: x[1], reverse=True)
for node, degree in sorted_out[:5]:
if degree > 0:
print(f" {node}: {degree} outgoing edges")
print("\nTop 5 proteins by in-degree (influenced by many others):")
sorted_in = sorted(in_degree.items(), key=lambda x: x[1], reverse=True)
for node, degree in sorted_in[:5]:
if degree > 0:
print(f" {node}: {degree} incoming edges")
# identify strongest causal relationships
edges_list = [(u, v, data['f_stat'], data['correlation'], data['p_value'], data.get('p_adjusted', data['p_value']), data['lag']) for u, v, data in G.edges(data=True)]
edges_list.sort(key=lambda x: x[2], reverse=True)
print("\nTop 10 strongest causal relationships:")
for u, v, f_stat, corr, p_val, p_adj, lag in edges_list[:min(10, len(edges_list))]:
direction = "→" if corr > 0 else "⊣"
print(f" {u} {direction} {v}: F={f_stat:.2f}, p_adj={p_adj:.5f}, lag={lag}")Additionally, strongly connected components (subgraphs where every node can reach every other node) are identified, which may represent tightly coupled regulatory modules and path analysis is performed to trace regulatory cascades between proteins of interest, providing systems-level insights that complement the individual edge-level information captured during network inference.
CausalEdge is implemented in Python 3.8+ using widely-adopted scientific computing libraries. Data manipulation relies on numpy for efficient array operations and pandas for structured data handling. Statistical modeling employs statsmodels for Granger causality testing and regression models, with scipy providing additional statistical functions including bootstrap sampling and percentile calculations. Graph construction and analysis utilize networkx, a library offering comprehensive network algorithms and visualization capabilities.
The implementation uses the following core libraries imported at the top of the script:
import pandas as pd
import numpy as np
import networkx as nx
from statsmodels.tsa.stattools import grangercausalitytests
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.multitest import multipletestsAdditionally, the main analysis function provides a user-friendly interface with preset default parameters:
G, causal_df, data_transposed, mediation_info, test_results = CausalEdge(
data=data,
id_column='Protein.Names',
corr_threshold=0.75, # Correlation pre-filter
f_stat_threshold=10, # F-statistic threshold
p_value_threshold=0.05, # FDR significance level
max_lag=2, # Maximum temporal lag to test
remove_mediated=True, # Enable mediation pruning
mediation_threshold=0.5, # Remove if >50% mediated
n_bootstrap=1000, # Bootstrap samples for CI
fdr_method='fdr_bh' # Benjamini-Hochberg FDR
)The CausalEdge framework addresses several challenges in network inference from time-series proteomics data. First, Granger causality provides temporal precedence, establishing that changes in protein A precede changes in protein B rather than merely correlating with them. This temporal ordering is important for distinguishing regulatory drivers from their downstream targets, a distinction that static correlation-based methods cannot make. When protein A's abundance rises before protein B's, and A's past values improve prediction of B's future values, we have much stronger grounds for inferring that A regulates B than if we merely observe that their abundances correlate across samples.
The explicit removal of mediated edges through bootstrapped mediation analysis represents CausalEdge's core innovation. Existing network inference methods, whether based on correlation, Bayesian networks, or information theory, typically report all significant relationships without distinguishing direct from indirect effects. The result is cluttered networks where upstream proteins appear highly central simply because they sit at the top of long regulatory cascades. By systematically identifying edges whose effects operate primarily through intermediate proteins and removing them, CausalEdge produces networks where centrality metrics genuinely reflect direct regulatory influence.
Rather than making binary decisions about whether to keep or remove edges, CausalEdge quantifies the proportion of each relationship that operates through mediators. This quantification provides valuable biological information. A partially mediated edge, say, 30% indirect, may represent a regulatory relationship where both direct and indirect mechanisms operate in parallel. Perhaps protein A both directly influences C and also influences C indirectly through B, with both mechanisms contributing to the observed correlation. Removing such an edge would discard genuine biological signal. Conversely, a fully mediated edge where 90% or more of the effect operates indirectly is most likely spurious and should be removed. By computing mediation proportions with bootstrap confidence intervals, CausalEdge enables informed decisions that respect biological complexity while still achieving network simplification. Additionally, the combination of FDR correction for multiple testing and bootstrap confidence intervals for mediation effects provides statistical control throughout the inference process.
For systems biology, the ability to generate condition-specific or subject-specific causal networks opens new possibilities for understanding context-dependent regulation. By applying CausalEdge separately to time-series data from different conditions, researchers can ask not just "what regulates what" but "what regulates what in this specific context?" This addresses the challenge of molecular moonlighting directly, revealing that the same protein may function as a master regulator in disease but have minimal regulatory activity in health, or vice versa. Such context-specific networks can be integrated into biological knowledge graphs as condition-annotated edges, enabling sophisticated queries about how protein function varies across biological states and supporting precision medicine approaches where therapeutic strategies must be tailored to individual regulatory architectures.
Like all statistical methods, CausalEdge rests on assumptions that may be violated in real biological systems. The Granger causality framework as implemented here assumes linear autoregressive relationships between protein abundances. While this assumption is computationally tractable and often reasonable as a first-order approximation, true biological regulation can be highly nonlinear. Cooperative binding, threshold effects, and switch-like responses mediated by positive feedback are common in regulatory networks and may not be well-captured by linear models. The result would be missed edges where genuine nonlinear regulatory relationships exist but cannot be detected by linear Granger tests. Extensions using nonlinear prediction methods—neural networks, Gaussian processes, or nonparametric approaches—could address this limitation, though at the cost of increased computational complexity, reduced interpretability, and potentially greater susceptibility to overfitting in the moderate sample sizes typical of proteomics experiments.
CausalEdge assumes that time series are stationary, meaning their statistical properties remain constant over time. This assumption can be violated in experiments with strong trends or oscillations. For example, if protein abundances gradually increase throughout an experiment due to cell growth rather than regulation, or if circadian rhythms introduce periodic dynamics, standard Granger tests may produce spurious causality. Preprocessing steps such as detrending, differencing, or explicitly modeling periodic components can mitigate these issues, though such preprocessing itself introduces assumptions and can alter the interpretation of results.
The framework requires sufficient temporal resolution to capture regulatory dynamics. If sampling intervals are too coarse relative to the speed of protein regulation, true causal relationships may not be detectable because the predictor and outcome appear nearly simultaneous. This is a fundamental limitation of any temporal method: one cannot infer the order of events that occur between measurements. I recommend at least 20 time points for reliable Granger testing, with sampling intervals matched to the timescale of the biological process under study.
Unmeasured confounders represent another challenge for all observational causal inference. If an unmeasured protein Z causes both A and B, Granger causality may incorrectly infer that A causes B (or vice versa) because their correlated changes actually reflect their shared dependence on Z. Mediation analysis partially mitigates this by testing whether relationships can be explained by measured intermediate proteins, but it cannot account for proteins not included in the dataset. This highlights the importance of comprehensive protein coverage: the more complete the proteomics measurements, the less likely that important mediators or confounders are missing. It also suggests that CausalEdge results should be viewed as hypotheses requiring experimental validation rather than definitive proof of causality.
Computational cost can become prohibitive for very large protein panels. Testing all pairs among 1000 proteins yields roughly half a million directed pairs to test, each requiring regression model fitting across multiple lags. Mediation analysis for each surviving edge then requires bootstrapping, multiplying the computational burden. While correlation-based filtering substantially reduces the number of pairs that reach Granger testing, and parallel processing can provide linear speedup, applying CausalEdge to datasets with thousands of proteins and hundreds of time points pushes current computational limits. Future work on more efficient screening procedures or approximations could extend scalability.
Finally, the mediation analysis assumes that effects can be decomposed into separate direct and indirect components without strong interactions between the cause and mediator. If protein A and mediator B have synergistic effects on target C—such that their combined effect is much larger than the sum of their individual effects—the standard mediation model may not accurately represent the biology. More complex mediation models with interaction terms could address this, though estimating interaction effects reliably requires larger sample sizes than typical proteomics experiments provide.
The immediate application domain for CausalEdge is time-series proteomics measuring hundreds to thousands of proteins across experimental time courses. Drug treatment responses provide a natural use case: by profiling protein dynamics after administering a compound, researchers can map how therapeutic interventions rewire regulatory networks and identify potential mechanisms of action or resistance. Disease progression studies, tracking patients longitudinally as conditions evolve, can reveal regulatory changes that drive pathological transitions, potentially identifying early warning signals or intervention points.
As wearable molecular biosensors mature to enable continuous protein monitoring in vivo, the temporal resolution available for CausalEdge analysis will increase dramatically. Current proteomics experiments typically sample every few hours to days; future biosensors may enable minute-to-minute measurements of key proteins in blood or interstitial fluid. This high-temporal-resolution data will be particularly well-suited for Granger causality analysis, enabling real-time inference of regulatory networks as they respond to meals, exercise, stress, infections, and other perturbations in daily life. The result could be personalized regulatory network maps that capture how an individual's biology responds to their specific environment and lifestyle.
The validation case study highlighted an application with particularly high impact potential: generating condition-specific and subject-specific networks that reveal context-dependent protein function. By applying CausalEdge separately to data from different biological contexts—healthy versus diseased tissue, before versus after treatment, different genetic backgrounds, distinct developmental stages—researchers can build libraries of context-annotated regulatory networks. These networks can be integrated into biological knowledge graphs where proteins are nodes, regulatory relationships are edges, and edge metadata specifies the conditions under which each relationship was observed. This enables queries impossible with static databases: "What does protein X regulate in condition Y?" "Which proteins have the most dramatically different regulatory roles across contexts?" "What regulatory rewiring distinguishes responders from non-responders to therapy?"
Such condition-specific networks directly address molecular moonlighting by making context-dependence explicit rather than averaging it away. A protein might appear as a master regulator with high out-degree in disease networks but have zero out-degree in healthy networks, revealing that its regulatory function is disease-specific. This information could guide therapeutic strategies: targeting such a protein might modulate disease processes while sparing normal physiology, reducing side effects. Conversely, a protein with constitutively high out-degree across all contexts might be a poor therapeutic target if inhibition would disrupt normal function as much as pathological function.
The framework can be extended to multi-omics integration by testing Granger causality across data types. Combining transcriptomics, proteomics, and metabolomics time series would enable inference of multi-layer regulatory networks capturing how changes in mRNA abundance drive changes in protein levels, which in turn affect metabolite concentrations. Such multi-scale networks would reveal regulatory cascades spanning molecular layers, providing a more complete picture of how cells respond to perturbations. The mediation analysis component becomes even more crucial in this setting, as indirect effects propagating across multiple molecular layers can create many spurious edges if not properly pruned.
While the current implementation uses linear autoregressive models for computational tractability, the mediation testing framework is general and could accommodate nonlinear prediction methods. Neural Granger causality, where neural networks replace linear regressions in the Granger test, has shown promise for capturing complex nonlinear dynamics. Gaussian process models offer another route to nonlinearity while quantifying uncertainty. Adapting these methods within the CausalEdge framework would extend applicability to systems with strong nonlinearities, though careful attention to overfitting and model selection would be required given the moderate sample sizes of typical proteomics datasets.
Integration with prior biological knowledge offers another avenue for enhancement. Protein-protein interaction databases, pathway databases, and literature-derived regulatory relationships could be incorporated as priors or constraints. Edges contradicted by known biology—such as regulatory relationships between proteins in different cellular compartments with no known trafficking mechanism—could be flagged for scrutiny or filtered out. Conversely, edges supported by multiple lines of evidence (detected by CausalEdge and present in literature-curated databases) could receive higher confidence scores. This integration would help distinguish genuine novel discoveries from potential artifacts.
CausalEdge is an open-source project and welcomes contributions from the community. If you encounter issues, have suggestions for improvements, or would like to contribute to the project, feel free to reach out: evanpeikon@gmail.com.