-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Using the mediation package for R, I performed six mediation tests for each multi-QTL. These tests ask the question: given that genotype is causally related to both phenotype X and phenotype Y, is the effect of genotype on Y mediated by X? A p-value of 0.05 or less was taken to mean that there is a significant mediation effect (no multiple test correction was done, so these are generous estimates).
The below table indicates the number of multi-QTLs where a significant mediation effect was observed. The value in cell (i, j) is the number of QTLs where data type i significantly mediates the effect of genotype on data type j. In other words, the rows are the mediators, and columns are phenotypes. There are 272 multi-QTLs in total.
| ace | e | me | |
|---|---|---|---|
| ace | 28 | 24 | |
| e | 28 | 25 | |
| me | 23 | 26 |
Mediation effects were observed in about a quarter of the 272 cases. Since multiple test correction was not done, this is likely an overestimate. Also note here that the table is almost symmetric. There were only 3 or 4 cases for each pair of data types where one direction was significant and the other was not. The mediation test is not designed to distinguish between a causal and reactive model, as the test below is. Rather, it is assumed that it is known a priori which variable is the potential mediator and which is the outcome, and the test determines whether there is in fact a mediating effect.
I applied the causal inference test from Millstein et al. to the multi-QTL quadruples, with 10 principal components removed. Given a genotype L, mediator G (in the paper this is gene expression), and observed phenotype T, this test differentiates between a causal model (L → G → T), a reactive model (L → T → G), and a model conforming to neither of these structures. I tested each pair of data types as the mediator and phenotype for each multi-QTL. The model calls are reported below.
It was not strictly necessary to test both directions for each mediator/phenotype pair, since the calls should be symmetric. That is, if we call the causal model when testing X as the mediator and Y as the phenotype, we should call the reactive model when testing Y as the mediator and X as the phenotype. I did the symmetric tests anyway, to make sure the method was working properly.

In the vast majority of cases, the model called was independent/other. This is in concordance with the other evidence of independence seen so far. In a smaller number of cases, methylation was selected as a mediator for both gene expression and acetylation. Oddly, the weakest relationship was between acetylation and gene expression. Note that these calls are not corrected for multiple testing. Doing such a correction would increase the proportion of samples identified as independent, since that call is made when the P-values for the causal and reactive models are both above 0.05.
I made an error when computing the pairwise correlations between data types. I realized that it didn't make sense for there to be more genes associated with a methylation probe than the other way around. Each methylation probe is only within cis-distance of one or two genes, but each TSS is proximal to dozens of probes. So, given the same correlations between the features, when multiple test correction is done for the number of genes per CpG, the p-values should increase very little. But, when we correct for the number of CpGs per gene, the p-values will increase a lot. This means that we should see many more CpGs with a significantly associated gene than genes with a significantly associated CpG.
This is not what we see in the table I computed last week. It's exactly the opposite: most genes have a CpG, but only a small proportion of the CpGs have a gene. When the error was fixed, the results are more reasonable, as below. The correlation structure is much weaker than before, which is more consistent with the independence observed in the network analysis.
| e | ace | me | |
|---|---|---|---|
| e | 12731 | 1411 | 2185 |
| ace | 1545 | 25219 | 2324 |
| me | 9462 | 8778 | 420081 |
I'm concerned about using the data with 10 PCs removed for this. By removing 10 PCs, we somehow tried to remove all signals "above" genetic ones from the data in order to uncover the QTLs. I suspect that patterns of gene expression, methylation, and acetylation are larger and broader than genetic effects (being often environmentally influenced), and thus we might have removed them when removing principal components.
Identifying correlated gene/peak/CpG triples is not as straightforward as it was with QTLs, since the multiple test correction is not symmetric. As explained above, we will undoubtedly see more CpGs with a correlated gene than genes with a correlated CpG. As such, it is not immediately obvious which set of correlations to use to find triples.
As a first pass, to be as stringent as possible, I chose features which were identified in both pairwise tests. 135 triples were found this way. I analyzed them using deal, by doing some hacky thing to get it to work with continuous-only data.

There is no particular network structure that dominates, and, unlike the QTL data, there is no clear pattern that persists regardless of principal component removal. This is consistent with the rest of the analyses which included the QTLs: there are few cases where all three data types interact, and no standard interaction pattern.
When networks are scored by their log likelihood, the best networks uniformly have all possible edges, only with differing directionality. This depends on using the likelihood only, and was not observed when a penalized network score (BIC or AIC), or the posterior probability, was used. I had observed here that when using likelihood as a score, full networks were always the best ones. This is really obvious: when using likelihood alone, more parameters will always result in a better fit.
This raises another concern with removing principal components - we might be throwing the baby out with the bathwater. I suspect that at least some of the independence we observed in the QTL data was artificial, as the dependence effects may have been removed along with the PCs. I was missing the point here. We're not interested in dependence relationships among the data without considering genotype - these are already well known. What we care about is whether the genetic effects on molecular phenotypes are mediating each other or not. So removing principal components is okay, because it gets us down to the genetic level.
There's a useful paper here which defines Gaussian networks and describes how prior information is incorporated.
I started looking for a way to assess the significance of, or confidence in, a network. deal uses the "relative probability", defined as Pr_(d|D)Pr(D), to score a network D. This is a direct quote from their paper. Since this quantity is obviously proportional to the network's likelihood Pr(d | D)_, the natural way to assess the significance is to do a likelihood ratio test.
But wait, this doesn't make sense. If the score is an unpenalized likelihood, then a network with a complete set of edges should always outperform every other network. But the best networks reported by deal are usually don't have all their edges. Upon further reading, it seems that deal's algorithm is relying quite heavily on the prior network specified. By default (and what I have been using), the prior network is the empty network, with no interactions whatever between the variables.
It's becoming clear that I don't really understand what's going on in the algorithm. There is some prior assumption on the configuration of the nodes. I have been using an independence distribution for this. Since we pre-selected data types which are associated with some QTL, it's not surprising that the arrows between genotype and the other data types overcome the prior independence assumption: the signal in the data is sufficiently strong. But most of the time, the other interactions are too weak to overcome this assumption. So it's not so surprising that we often see independence, given that I guess we assumed it in the first place... 😞
Also, it turns out that deal doesn't allow discrete nodes to have parents anyway, so blacklisting the edges into genotype is unnecessary.
We need a way to aggregate the effects of all features from one data type which are correlated to the same single feature from a second data type. As a first pass, I redid a previous analysis looking at multi-meQTLs, that is, SNPs which were significantly correlated with two or more separate CpGs.
For each SNP considered here, we have a group of Holm-Bonferroni corrected P-values arising from all the CpGs which were associated to that SNP. On the y-axes below is the -log10 P-value of the correlation between the first PC of these CpGs, and the SNP. On the x-axes are various summary statistics of the -log10 transformed group of individual P-values from the CpGs. The red line is the identity.
We can see that using the first PC as a proxy to the collection of related CpGs works moderately well. In all boxes except Max, most of the SNPs lie above the identity line, indicating that the correlation with PC1 provides a better P-value than the summary statistic. However, the Max box, which shows the best P-value for each SNP among all CpGs tested, is roughly bisected by the identity line. The outliers in the top left quadrant of each box had a P-value of 0 when correlated with PC1, so the -log10 P-value for those points was set to 200. I am still investigating why the correlation with PC1 was apparently perfect in these cases, but since it's Spearman correlation, it may just be a co-incidental alignment of ranks.

I computed all pairwise correlations between data types and used the same two-step multiple testing procedure to find the number of significantly correlated features in each data type. The below table summarizes the results. The entry in cell (i, j) is the number of features of data type i which were significantly correlated with a feature of data type j. The values on the diagonal are the total number of features tested. Note that the matrix is not symmetric - this is due to differences in multiple testing (eg. there were more CHiP-seq peaks tested per gene than genes tested per peak).
| e | ace | me | |
|---|---|---|---|
| e | 12731 | 2035 | 12677 |
| ace | 1324 | 25219 | 25134 |
| me | 5015 | 5342 | 420081 |
Notice that nearly 100% of expression and acetylation features were correlated with methylation at some CpG. This is likely due to the density of CpGs throughout the genome, compared to genes and histones. Also note that there are very few correlations between expression and acetylation, likely also due to their sparsity. Only 63440 total comparisons were made between genes and peaks, compared to 1081593 between genes and CpGs and 1770799 between peaks and CpGs.
There seem to be a large number of SNPs on the Affy 6 array with non-integral values recorded. A histogram of the number of samples without integral data, per SNP, is shown below. Note that this plot does not include counts for over 140,000 SNPs from the manifest which were not in the data at all (see the next section). The categories indicate the type of SNP being considered. Unfortunately, the distribution does not seem to change if we only consider SNPs proximal to any genomic feature, suggesting that these values are not due to poor SNP calling in intergenic regions. This histogram is on a log scale.

To see if there were particular patients who were driving up the counts, I also generated the inverse histogram, namely the number of SNPs with a non-integer count per patient.

We see here that almost all patients have less than 1500 SNPs with a non-integer count, meaning that there are no patients with mostly imputed values. There are no patients with less than 5000 non-integral SNPs, out of a total of over 700,000 SNPs.
The investigation into the SNPs prompted me to consider possible missing data in all the data types. The gene expression data is aggregated from RNA-seq data, so there was no probeset to match against. Thus, I only considered the other three data types. Here I tabulate the features (autosomal only) on each manifest, and list the number of missing features.
| acetylation | genotype | methylation | |
|---|---|---|---|
| features on manifest | 25719 | 867693 | 473697 |
| features missing | 500 | 141598 | 53616 |
The acetylation peaks missing are NA for all patients in the data set. The number 500 is just a coincidence (when including non-autosomal peaks, it increases to 547). The missing SNPs may have had too low MAF, since MAF > 0.05 was a condition of inclusion in the transposed_1kG files. I don't know about the methylation data.
I wrote some helper functions to load the different data types. After several rounds of debugging, I sorted out the script to run Matrix eQTL by removing the PCs from the whole dataset first; it's running now.
Currently I'm rewriting the Matrix eQTL wrapper code to remove principal components before splitting the data up by chromosomes.
I looked at the distribution of summary statistics (mean, median, IQR) across data types to make sure normalization was already done. Everything looks okay, although the methylation data was normalized differently (min/max were set to 0/1).

No post last week because I spent all my time making the poster and talk, writing documentation and Makefiles, and doing homework. Also, everything here does not include chr1 and chr2 for methylation data, because those are still running...
I assessed the number of features with a QTL which were gained and lost with each successive removed principal component. For expression and acetylation, with each successive PC removed, we lose about 150-200 features that were significant in the previous PC, and this number doesn't change much as PCs are removed. Methylation is a little different - it looks like there are a lot of spurious features which get removed with the first 8 PCs.

I successively removed the first 20 principal components from the gene expression data and recalculated the eQTLs. The number of associations increased with each component removed.

After much negotiation with Matrix eQTL, I computed all the QTLs taking into account the covariates pmi, age_death, msex, EV1, EV2, and EV3. This analysis is now using a reduced set of patients (n=389) which have all 3 data types, and all covariates. These numbers keep changing a little bit every time I do this... I think there is some numerical instability.
| eQTL | meQTL | aceQTL | |
|---|---|---|---|
| significant features | 3629 | 79686 | 3301 |
| Spearman's rho | 0.18 [-0.24-0.26] | -0.0014 [-0.13-0.13] | 0.001 [-0.12-0.12] |
| significant features (no cov.) | 3670 | 82547 | 3442 |
| Spearman's rho (no cov.) | 0.18 [-0.24-0.25] | -0.0014 [-0.13-0.13] | 0.00025 [-0.12-0.12] |
I found a couple of errors in the original QTL analysis. First of all, I should have obtained the list of SNPs to use directly from the Affy 6 chip manifest (these are the non-imputed ones). Second, I neglected to lift over the CHiP-seq peak co-ordinates to hg38, whereas all the other data types (including SNP locations) were lifted over. Fixing this, and using Matrix eQTL for the analysis, resulted in a slightly increased number of significant genes and a much increased (about 4 times) number of peaks with a significant QTL.
Also, I've come to the realization that using a relational database for this data is not actually any faster than working directly from the flat files (the data.table package for R helps immensely), and is a significant maintenance burden. So I'm mostly abandoning it and redoing some of the old analyses from the flat files.
I compared the eQTLs I found (using only the non-imputed SNPs with MAF >= 1%) with the eQTLs recovered by a separate analysis of the ROSMAP data. My analysis recovered 3757 unique genes, whereas theirs recovered 4796. The difference was likely due to the 10 times larger window size used in the ROSMAP analysis. Of the 3757 significant genes in my analysis, 2660 (71%) were also found by the ROSMAP analysis. This implies that 2660 (55%) of the ROSMAP genes were also found by my analysis.
I tried increasing maximum TSS-QTL distance to 1 Mb, as the ROSMAP data does. Oddly, this resulted in about 20% fewer genes being identified as significant (2984 genes total). In particular, about 1100 genes lost significance, while only about 300 new genes were identified. I suspect this is due to multiple test correction at the gene level. I use Holm correction (essentially Bonferroni) for the number of QTLs tested per gene, whereas the ROSMAP data uses a permutation test, which may be more forgiving.
I next focused on the 2660 genes which were identified as significant by both pipelines.
For each of the topologies produced by deal for the best individual multi-QTLs, I performed a mediation analysis dictated by the presence of partial triangles within the graph. For each triplet of nodes b <- a -> c, with two outgoing edges from a to both b and c, I tested the hypothesis that b mediates a -> c, which would correspond to an edge b -> c in the topology. This result was then compared to whether or not the edge was actually present in the network, and then the symmetric test was done (ie. c mediates a -> b). Note that the networks were built with the original data, but the mediation tests were done with rank-transformed data, since they involve two linear model fits.
First, it is clear that mediation analysis cannot distinguish the directionality of the b - c arc. The tests for the forward and reverse arrows produced nearly identical p-values (Pearson correlation = foo, p = foo). Hence, it is not informative to test for the absence of the c -> b arc when the b -> c arc is present in the topology. Therefore, I restricted the tests to two cases:
- when there was an arc between
bandcin the graph, I tested for the arc in the direction present in the graph - when there was no arc, I tested for both arcs, then took the lowest p-value
The p-values were then corrected using Holm correction (same as Bonferroni).
There were only 14 mediating edges truly present in the topologies. Of those, 9 were supported by mediation analysis, and 4 were not. Likewise, there were 105 potential mediating edges which were not present in the graphs, and only 3 of these were supported by mediation. I take this as evidence in favour of the deal topologies.
I used the mediation package to perform a test of 6 hypotheses per gene:
- the effect of genotype on expression is mediated by acetylation
- the effect of genotype on acetylation is mediated by expression
- etc.
In rough concordance with the deal results, no significant mediation effects were detected in 31 (best data) or 28 (PCA data) QTL groups. All mediation effects were observed between 15 and 24 times each.
Bayesian networks (at least, the most available implementations) assume that continuous variables follow conditional Gaussian distributions. This means that the variables themselves are not necessarily Gaussian. However, in our case, we know that the genotype variable doesn't depend on anything, so if we are to model genotype as a continuous variable, it must follow a Gaussian distribution. Obviously, it doesn't, so, we need to model genotype as a discrete variable.
This brings us back to the deal and CGBayesNets packages, since bnlearn doesn't support mixed data. Here's a histogram of model topologies produced with deal, which is roughly the same as what was produced with most of the bnlearn methods. The most common topology is the one where all epigenetic factors are independent from each other. There is a subset of cases which get assigned a different topology based on which data set was used (best feature for each QTL, vs. PC1 of all features). I wonder if these are the same as those cases where correlation with PC1 is poor.

CGBayesNets was not as consistent, producing a wider variety of model topologies with no particular most common one.

For the same experimental design as below, I tabulated the number of times each distinct model topology was present in the output (ambiguous topologies, ie. with an undirected arc, were counted a fractional number of times. Without using a blacklist, there were about 8 topologies which were tied for most common. When using a blacklist, most of the designs produced one most common topology, namely [g][e|g][a|g][m|g], for about 20/65 cases (including fractional cases where the topology was ambiguous). This topology corresponds to no interaction between epigenetic phenotypes. The remaining cases were distributed among a large number of other topologies, each having between 0 and 2 cases.
Oddly, when the log likelihood score was used, the result was different. All cases fell into one of the following 6 topologies, each having between 9 and 16 hits.
[g][a|g][e|a:g][m|e:a:g]
[g][a|g][m|a:g][e|a:m:g]
[g][e|g][a|e:g][m|e:a:g]
[g][e|g][m|e:g][a|e:m:g]
[g][m|g][a|m:g][e|a:m:g]
[g][m|g][e|m:g][a|e:m:g]
Performed Bayes net learning with the bnlearn package on each QTL which was strongly associated with all three other data types. Tried a large number of combinations of different learning strategies. The data used was either the best (ie. lowest q-value) feature for each QTL, or the first principal component of all the features.
| approach | score/test | blacklist | data |
|---|---|---|---|
| score | log likelihood | yes | pca |
| score | log likelihood | no | best |
| score | log likelihood | no | pca |
| score | log likelihood | yes | best |
| score | AIC | yes | pca |
| score | AIC | no | best |
| score | AIC | no | pca |
| score | AIC | yes | best |
| score | BIC | yes | pca |
| score | BIC | no | best |
| score | BIC | no | pca |
| score | BIC | yes | best |
| score | Gaussian posterior density | yes | pca |
| score | Gaussian posterior density | no | best |
| score | Gaussian posterior density | no | pca |
| score | Gaussian posterior density | yes | best |
| constraint | Pearson correlation | yes | pca |
| constraint | Pearson correlation | no | best |
| constraint | Pearson correlation | no | pca |
| constraint | Pearson correlation | yes | best |
| constraint | Fisher-Z | yes | pca |
| constraint | Fisher-Z | no | best |
| constraint | Fisher-Z | no | pca |
| constraint | Fisher-Z | yes | best |
| constraint | mutual information | yes | pca |
| constraint | mutual information | no | best |
| constraint | mutual information | no | pca |
| constraint | mutual information | yes | best |
| constraint | mutual information with shrinkage | yes | pca |
| constraint | mutual information with shrinkage | no | best |
| constraint | mutual information with shrinkage | no | pca |
| constraint | mutual information with shrinkage | yes | best |
The only factor which caused much change in network topology was the inclusion of a blacklist. Without the blacklist, there were about an equal number of connections observed to and from genotype. In all cases, the links involving genotype were observed more frequently than other types of links. Below are two typical consensus graphs. The thickness of the arrows corresponds to how many times it was observed.


Several SNPs correlate with more than one CpG. For these "multi-meQTLs", I computed the correlation between the SNP and the first principal component of the methylation data from all their associated CpGs. In the vast majority of cases, the Spearman correlation between the first PC and the SNP was quite a bit smaller than the mean of the individual correlations. This suggests strong correlations between nearby CpGs, as expected.

The black dotted line here is x = y. Points below this line show improved correlation with PC1, while points above show reduced correlation with PC1 compared to with the individual CpGs. Notably, most points are above the line, although there is a subset of multi-meQTLs which show a much reduced correlation to PC1, suggesting that these multi-meQTLs control more than one CpG independently.
Update March 2 2015: I had the wrong idea when making this section and it's not relevant anymore. I was trying to build a network with one node for each data type all aggregated together, instead of one network per gene.
~~There were 68 SNPs which significantly correlated with all three other types of data. For each of these, I took the single gene, CHiP-seq peak, and CpG which had the lowest q-value and were associated to that SNP. I then used the deal and bnlearn packages to make Bayesian networks with these data. deal has only score-based methods, while bnlearn has both score- and constraint-based methods available. Because the data is so small, and for easier comparison between the two packages, I used the score-based method in bnlearn.
bnlearn doesn't have a built-in exhaustive search, but with such a small space of possible networks (218 with no constraints), tabu search is quite likely to hit them all, so that's what I used. bnlearn also has a number of available scoring functions, by default it uses the BIC. I changed this to use the log likelihood, which is most similar to the scoring function used by deal (see the deal paper).
Unfortunately, the two methods did not produce the same network. deal was robust to the use of a blacklist, whereas bnlearn incorrectly inferred an arrow "e -> g" when this was not blacklisted. All networks agreed on arrows "g -> a" and "e -> a", but not on the other arrows.~~
Added corrected P-values and q-values to the database because it takes too long to re-compute them on the fly.
The best SNP per feature (gene, CpG, or CHiP-seq peak) was taken for all significant features. q-values were calculated from the raw (Spearman's rho) P-values for these SNPs in the other two types of data. Perhaps surprisingly, most QTLs did not overlap with other types of QTLs.
Update March 1 2015: I redid the analysis and it turns out I was wrong about that, most of them do overlap. Mostly I think this was because of not lifting over the peak co-ordinates.



There were only 65 194 SNPs which were simultaneously the best SNP for at least one gene, CpG, and peak.
I noticed something odd when lifting over the methylation probe co-ordinates to hg38. Some pairs of probe co-ordinates mapped to the same locus.
- chr4:75310450 and chr4:75480226 both map to chr4:74444733
- chr10:51844517 and chr10:47900640 both map to chr10:50084757
- chr17:75221243 and chr17:75016973 both map to chr17:79509296
Since it's only three probes and, by eye, the values look roughly (though not exactly) the same for the same patient, I arbitrarily dropped one value for each patient at these locations.
Finished loading all genotype, expression, acetylation, and methylation data into database. Calculated Spearman's rho for all SNPs within 200 KB of either TSS (expression), CpG (methylation), or peak centre (CHiP-seq). P-values were first corrected within-feature by Holm's method, then among-feature by FDR with the qvalue package.
Update March 1 2015: these values have changed slightly due to redoing the analysis with Matrix eQTL.
| eQTL | meQTL | aceQTL | |
|---|---|---|---|
| patients | 421 | 581 | 566 |
| features tested | 12624 | 385878 | 25665 |
| SNPs tested | 278341 | 668079 | 427264 |
| total tests done | 521089 | 19064133 | 1234661 |
| SNPs per feature | 38 [24-54] | 46 [29-66] | 45 [29-64] |
| features per SNP | 1 [1-2] | 18 [6-37] | 2 [1-4] |
| significant features | 3870 | 79686 | 3301 |
| Spearman's rho | 0.0048 [-0.14-0.14] | -0.0014 [-0.13-0.13] | 0.001 [-0.12-0.12] |
| significant features (no cov.) | 3895 | 82547 | 3442 |
| Spearman's rho (no cov.) | 0.0055 [-0.13-0.14] | -0.0014 [-0.13-0.13] | 0.00025 [-0.12-0.12] |
Here, "feature" means gene for eQTLs, CpG for meQTLs, and peak for aceQTLs.
There were fewer significant aceQTLs than eQTLs, although about twice as many tests were done so it may be a multiple testing problem. On the other hand, there were about 25 times as many QTL-mediated CpGs as genes, with about 40 times as many tests done. The magnitude of effect for aceQTLs is significantly smaller than that of eQTLs, suggesting a weaker genetic component. The number of QTLs per feature (gene or peak) was comparable between acetylation and gene expression. The correlation between expression and genotype is more often positive, while meQTLs and aceQTLs tend to be more negative associations.
A Venn diagram of how many patients have different data types.

Usage of the heuristic search in the deal package is fairly straightforward and involves the functions autosearch and heuristic. There is an example in this paper (section 8). It is quite fast to run, although could be made slower if more random restarts are used (I'm using only 2). The degree of nodes is bounded by 10. In the produced network, there are many interconnections between modules, but interestingly, few between modules and phenotypes.

On the other hand, the FullBNLearn function in CGBayesNets is still very fast, which is odd, as it seems from the example file that this is an exhaustive search method. This may be because CGBayes is constraint-based, rather than score-based, so its method of building a network (conditional independence, or something?) is more of a deterministic algorithm.

By eye, these networks look completely different. Compare with the below. The red edges were produced by deal but not by CGBayesNets. The blue edges were produced by CGBayesNets but not by deal. The black edges are in both networks.

There is too much data to use SQLite. I have switched to PostgreSQL. I'm using some of the ideas from their documentation on populating databases and partitioning to try to maintain reasonable performance and size.
Update March 1, 2015: I'm going back on this probably. It's too much trouble and it's actually not saving any time in the end, compared to working on flat files.
Currently, the unplaced genomic contigs, as well as the alternate MHC alleles, are being ignored.
Because there is just too much data, I'm using for genotype data the transformed data containing only SNPs with "good imputation scores and MAF > .01" (see Genotype). Some of these reference rsID's which are not in the UCSC release of dpSNP, so I have to use the HapMap dbSNP release, which is about twice as large. The experiments were done using hg19 as a reference, so the options are to either back-convert the SNP co-ordinates (which are now on hg38) to hg19, or liftOver the experiment co-ordinates to hg38. I've opted for the latter (newer is better, right?).
Obtaining the list of Ensembl genes from Ensembl directly instead of through UCSC, since UCSC doesn't yet have the Ensembl genes mapped to hg38. Have to use their Perl API.
There are a number of SNPs which have been genotyped which no longer exist in dbSNP, either they have been deleted, or merged into another rsID, or they don't map to hg38. I'm excluding these.
Out of the 509 patients which have been assayed for gene expression, 58 were genotyped with the omni chip, 437 with the affy chip, and 14 were not genotyped. For now, I will restrict analysis to the affy experiment.
There are 191 815 patients with all 5 phenotype variables of interest defined. The following analyses were done on those patients only.
I have found two packages which can impute Bayesian networks from mixed data: deal for R, and CGBayesNets for Matlab.
The deal package is score-based. With only 191 815 observations of 5 variables, it is possible to enumerate all networks using exhaustive search (the networkfamily method) and choose the best one based on score. Four networks were tied for the best score. The difference between the two networks is the direction of causation between globcog_random_slope and tangles_sqrt, and between pathoAD and pmAD. The next best networks have a relative score of about 0.02, compared to the top scoring networks.

CGBayesNets seems to be constraint based. There is no documentation for this package, but network construction seems to be based on some kind of case/control structure, as all the network construction functions take an argument called pheno which partitions the data into two sets. The function FullBNLearn seems to be an exhaustive search, so that's what I used.

Most of the edges produced by CGBayesNets are also supported by the deal graphs. However, there are some edges missing from the CGBayesNet graph which are present in the deal graphs. In particular, there is no edge connecting pmAD to pathoAD in the CGBayesNet graph.
Since the theory surrounding discrete Bayesian networks is much better developed, it might be worth trying to discretize the continuous variables. The bnlearn package seems to be very mature.
Amyloid load seems somewhat bimodal, and might be bisectable at 2.

Tangles and cognitive decline exhibit very slight modes but it is difficult to pick out splitting points by eye. Most likely this will need to be done automatically.

I created some pairwise plots of all the phenotypic variables of interest to identify any we could potentially use the information geometric inference on.

We can see that the three continuous variables are all correlated, although the relationships are very noisy. I implemented the information geometric causal inference method, including the authors' suggestions for improving robustness on noisy data. Since the method only allows for testing of increasing relationships, I tested the negative of globcog_random_sqrt against the other two variables. All the variables were rescaled to lie in the interval [0, 1] by shifting and scaling. These tests were only done on patients with all three variables defined (n=820).
According to this method, there is no support for the following causal relationships:
-
amyloid_sqrt->tangles_sqrt -
amyloid_sqrt->-globcog_random_slope -
-globcog_random_slope->tangles_sqrt
The first two exclusions are supported by both Bayesian networks, but the third is not.
I've loaded most of the data into a SQLite database. The schema is below. There is still some confusion around the genotype data, namely why there are two columns per patient. Also, I'm not sure if I need to integrate the Illumina Omni data as well.
Update March 1, 2015: I'm moving away from the database idea. It seems to be more trouble than it's worth.