Skip to content
forked from sflury/kendall

Calculate non-parametric correlation coefficient Kendall's tau, accounting for censoring (upper/lower limits).

License

Notifications You must be signed in to change notification settings

Knusper/kendall

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Kendall's tau for censored data in python

Tau logo

The python function kendall in kendall.py calculates a non-parametric correlation coefficient (Kendall's τ), that measures the strength of correlation for a paired sample of ordinal level data. Here the data may be partially censored (either with upper- or lower- limits, but not with mixed upper- and lower limits). Kendall's τ can also be used as a statistical test to rule out the null-hypothesis that the two variables are uncorrelated.

The calculation of tau and the p-value follow the calculation of Isobe, Feigelson, and Nelson (1986). Originally this formalism was developed in the context of medical science1 by Brown, Holander & Korwar (1974). With respect to partial correlations the formalism is also presented in Akritas & Seibert (1996).

The p-value calculation requires the distribution and the variance of τ under the null-hypothesis. For uncensored data and large enough n the distribution can be approximated by a normal distribution (e.g., Wikipedia). In this case the resulting expression depends only on the sample size2. Thus caution regarding the here calculated p-values is advised when using this code for small samples with uncensored data. Use scipy.stats.kendalltau instead.

For censored data and large n the distribution of τ under the null-hypothesis is approximately normal as well, but the variance depends on the distribution of censored values with respect to the sample proportions (Oakes 1982). Thus, in practice, an estimate of the variance from the data is required. This code follows the approach of Isobe et al. and Brown et al., but more refined approaches exists in the literature. An example developed with astronomical data in mind is given by Akritas, Murphy, and LaValley (1995); as of yet the computation of p-values with this variance estimator is only implemented in R as part of the package NADA (routine cenken). This formalism also support simultaneously left- and right- censored data, and a implementation in python thus appears desireable3.

Additional functionality is included with the function tau_conf. This function determine the robustness of the correlation coefficient due to each individual datum (done by bootstrapping) or uncertainties in the data (done by Monte Carlo sampling). A description of the idea beyond these procedures can be found in Curan (2015, arXiv:1411.3816).

Provided functions

kendall(x, y, censors=None, varcalc="simple", upper=True)

tau_conf(x, y, x_err=None, y_err=None, censors=None, p_conf=0.6826, n_samp=int(1e4), method="montecarlo", varcalc="simple", upper=True)

See online help of those function (or source code) for notes on their usage.

History of this code

A python implementation of the Isobe et al. algorithm was initially written by S. Flury for work presented in Flury et al. (2022). This code assumed the theoretical value for the variance in the case of uncensored data and large n. E.C. Herenz modified the code to use the empirical variance calculation as described in Isobe et al. (1986) for work presented in Herenz et al. (2025).

Requirements

Acknowledging the use of the code

If your research benifits from this code, please cite Isobe et al. (1986) and include a link to this github repository.

Copyright

The code is released under GPLv3 license (see LICENSE). Copyright: E.C. Herenz (2024), S. Flury (2023)

Footnotes

  1. Survival time comparison between patients receiving a heart transplant with patients not receiving such treatment.

  2. For small samples and uncensored data the distribution can not be written down in closed form. It requires evaluation of all possible permutations of the N pairs under the null hypothesis. Then the calculation of the p-value requires the the calulation of all |τ| values for these permutations. While some trickery can simplify this calculation, it is not yet implemented here; scipy provides it since ~2019 and conservatively assumes n<50 as small (R uses n<60) -- see the resolved issue at github. In practice, I found that for n = 15 the critical |τ| values for the threshold p=0.05 differ by ≈ 10⁻². Critical values for |τ| for given p-values are found also tabulated in the statistical literature.

  3. PRs are very welcome. The existing code in NADA seems very confusing and the acompanying book does not shed light on the issue.

About

Calculate non-parametric correlation coefficient Kendall's tau, accounting for censoring (upper/lower limits).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%