This repository contains my MSc's final thesis, a quick explanation of its results, and some of the most important Python scripts that I made towards the completion of the thesis.
My final thesis “Development of a single-Subject Predictive model of Alzheimer's disease using fMRI and machine learning techniques in individuals with Mild Cognitive Impairment” was an endeavour that involved working with big data, fMRI and machine learning models. I used Python to carry out the data analysis from data retreived from a neuroimaging project called ADNI1.
For the data analysis I took a multi processing pipeline approach, using a leave one-out cross-validation procedure (the data was scarce so there was a need to reduce dimensionality). In order to select the best fitting algorithm, there were several processing pipelines, and for each one of them a different machine learning model was fitted: artificial neural network (multilayer perceptron), logistic regression, naive bayes, random trees, etc.
The conclusions of the master thesis were that Alzheimer's disease has potential to be predicted by using fMRI in a population at risk for developing it (mild cognitive impairment) with a decent accuracy. This can be done using machine learning and an abstract computational neuroscience concept called functional connectivity (an indirect measure on how interconnected a brain is). However, further research should be done and results are to be interpreted with a grain of salt, given that although best fitting model accuracy was good and specificity was high, the sensitivity of the classification was low, probably due to the low sample size of mild cognitive impairment patients who would turn into Alzheimer’s disease as opposed to the sample size of those who wouldn’t.
- Tech stack: Python 3. Used modules: sklearn (machine learning), matplotlib (data visualization), seaborn (enhanced layer for matplotlib), numpy (matrix laboratory), pandas (data science). In order to write my final thesis I used LaTeX as, to the best of my knowledge, is the highest typographic quality system available as open source.
In the last years scientists have tried to develop predictive models for forecasting a future onset of Alzheimer's disease (AD) in people with Mild Cognitive Impairment (MCI), using fluid, imaging and neuropsychological biomarkers. To the best of our knowledge, the question of whether functional magnetic resonance imaging (fMRI) can serve as a viable predictor for the aforementioned disease in MCI individuals remains unanswered. We have developed a single-subject predictive model using, for each individual, solely a rsfMRI scan obtained at the screening visit of the ADNI and follow-up longitudinal data about future outcomes (AD presence/absence). We included in our model 23 MCI-c (mean time until conversion: 1.65 years) and 51 MCI-nc patients (mean time of follow-up: 4.61 years). Scan preprocessing pipelines consisted on registering each scan to Shen's Atlas (214 ROIs), extract functional connectivity measures and use them as predictors (either with or without dimensionality reduction) and pair them with the future outcomes to train and test supervised machine learning models via 10-fold cross-validation. We found conversion from MCI to AD can be predicted from rsfMRI with a multilayer perceptron with no dimensionality reduction with a reasonable accuracy (77.03%, 95% CI from 67 to 87%), good ROC AUC (0.81), very high specficity (specMLP = 90.20, 95% CI from 83 to 97%) but weak sensitivity (47.83%, 95% CI from 36 to 59%). Logistic regression and the Support Vector machines also obtained reasonable diagnostic accuracies (75.68%, both of them). The models cannot be deployed to clinical practice yet: further research is needed to increase sensitivity.
The most important files and folders the reader can refer to, to see either the full thesis or the code with which I got the results from the reader can be headed to any or all of the following folders or files (see thorough explanation for each relevant file):
-
TFM_FINAL_santiagosanchez.pdf: the final thesis as was handed in to the university.
-
Writing: Contains the LaTeX files and figures that were put together in order to compile * TFM_FINAL_santiagosanchez.pdf *
-
DataAnalysis: this folder contains the most important programs I made in Python to do the analysis. Within the sub folder inContext you'll find the same programs, but in context with some of the raw data behind (and other files that contributed to the development of the final thesis)2. The most important files are linked here and you can look into them:
-
mira_dades.py: In this script the most important functions are these:
- Merge two datasets with complementary information: you can see
IMPORTANT_depura_ADNI_MERGE_i_ARXIU_fMRIs() - See which subjects convert to alzheimer's and which do not: see
temps_de_conversio(SUBMODALITAT_fMRI)
- Merge two datasets with complementary information: you can see
-
a. fes_merge_en_tensor_3d (obte l'array 3d).py: In this file the most important snippet of code is the funcion
fes_merge(n_timeseries)(see down below).On the one hand, this function went to each folder name of syntax "SUBJECTID__UID" and got the "SUBJECTID" (identifier for subject) and "LLUID" (identifier for scanner type made to a given subject) to save them in list objects and finally store them in respective .txt files (
__subjectes.txtand__uids.txt). On the other hand, the function also went into each subject folder to findtotal.txta file that contained fMRI data extracted from .NIFTI files after registering that information to Shen's atlas3 via FSL. The information inside each subject'stotal.txtwas structured as shape(214,140)(the first dimension being the Regions of interest (ROIs4) of Shen's atlas -214- and the second one was the number of time series data points available -140 time series data points for each ROI-) and finally I stacked together all these matrices as a numpy ndarray object (.npy) to have it all within one single file (__arr_ADNI_3d_preprocessada.npy) of shape(93,214,140)5. -
b. analisisfinal.py: The most important functions of this file are:
fes_dataframe_despres_dels_3_criteris_dinclusio(d): This function narrowed down the subjects that went into the machine learning models according to inclusion criteria of the methods section of my final thesis.The most important inclusion criteria within that function was how I decided to filter out the subjects who would not convert to Alzheimer's (see the footnote to understand the reasoning that was to be followed6). The idea was to choose the criterion of considering those who wouldn't turn to alzheimer's (MCI-nc) as those subjects who hadn't turn to alzheimers above the time period
$t$ , such that:$t > \mu + n\rho$ Where
μ = 2.3 years (average time of turning to alzheimers for those who turn) ρ = standard deviation of those who turned to AD n = 1.85 (arbitrary number, but if we consider the distribution of time for MCIc to be normal, that would way above the central tendency of the data).In the following code, you can see I stored the subjects I would finally analyse within the
df_MCIncdataframe, by writing this:mat_corr_labels_i_pipeline(): This function took the numpy ndarrayarr_TSwith shape (57,214,140) (57 subjects with 214 ROIs each and 140 Time Series per ROI) applied the Functional Connectivity definition. The Functional Connectivity (FC) is a construct that tells how well connected a brain is. Here, the FC of one given subject was defined as all possible correlations between the time series of the subject's ROIs, vectorized in a single line. Namely, for each matrix (n,m) = (214,140) we would get a (n,n) = (214,214) pearson correlation matrix (or the adjacency matrix of a complete graph with 214 nodes) calledarr_corand stored as out_FC(57 subjectes).npy7 as obtained in this lines of code:And after that, we would take the
arr_corrfor each$i$ subject in a for loop (arr_corr[i]) and obtain the flattened vector with the pearson correlations calledll_vectorand stored within theXmatrix that will go to the machine learning model input matrix.ll_vectorwill have shape$(1/2) \cdot n(n-1)$ :The auxiliary function call in line 378 that allows us to do obtain
ll_vector(ll_ROIxROI_flattenedwithin the function) is here:Finally we obtain the Y matrix, with the labels (wheter or not the MCI will convert to AD or no, with a 1 or 0 respectively) and run the machine learning models:
apc.PCA_10foldCV_ROC(X=X .. )is the function call to thePCA_10foldCV_ROCof another file, which are described down below: -
aux__plot_roc_crossval.py: As we said, this is the file where the machine learning models are run. 10-fold-crossvalidation, PCA, Artificial Neural networks., etc. Within this file the most important function is the one we called at line 410 in the previous file above and goes as follows:
-
aux__reduccio_dimensions.py: This file was done in order to implement a graphical interpretation for the explained variance of a principal components analysis (i.e it plots the eigenvalues of each component for the previously chosen number of components). The scree plot is a graphical way of seeing whether each component of a PCA adds value to what we are trying to model in terms of parsimony. Softwares such as SPSS or R have the possibility of getting a Scree plot easily, but python doesn't (at least, not while I was doing this thesis). So I coded this program to get the scree plot.
-
The packages I've used in this thesis are sklearn for machine learning, matplotlib and seaborn for graph generation and visualization, pandas for data analysis, numpy for treating the multidimensional arrays that came out of the NIFTI files that came out of the fMRIs.
Footnotes
-
The ADNI(Alzheimer's disease neuroimaging iniciative) gathered together high quality, free to use data on the alzheimer's disease, an without this data this thesis wouldn't have been possible. ↩
-
Lots of files are missing as github has limited space, so this is a simplification of the directories of my final thesis, not an exhaustive tour to it! ↩
-
NIFTI files are not here, as they occupy several Gigabytes of data. This is just for showcasing purposes. ↩
-
Region of Interest (ROI). ↩
-
(93,214,140) --> (SUBJECTS, ROIs,TIME SERIES)↩ -
In my final thesis I took subjects that were at risk of developing Alzheimer's disease but didn't have the disease (these are what are called Mild Cognitive Impairment or MCI). At the moment of diagnosis as MCI, they had to undergo a brain fMRI scan. Then time went on and after some years some of them became Alzheimer's disease (MCI-c), and some didn't (MCI-nc). The ones who I labeled as MCI-c are clear to label, but the ones who are MCI-nc are not clear (because there might be not enough longitudinal data because the patient might die, get discontinued from the study). So it was important to choose a criterion to decide how many years of follow up are enough for a patient to be considered free of Alzheimer's. ↩
-
The actual code got 74 subjects, as changes in the inclusion criteria (MCI-nc cutoff time point) varied. ↩