Description

This repository contains my MSc's final thesis, a quick explanation of its results, and some of the most important Python scripts that I made towards the completion of the thesis.

Extended Summary

My final thesis “Development of a single-Subject Predictive model of Alzheimer's disease using fMRI and machine learning techniques in individuals with Mild Cognitive Impairment” was an endeavour that involved working with big data, fMRI and machine learning models. I used Python to carry out the data analysis from data retreived from a neuroimaging project called ADNI¹.

For the data analysis I took a multi processing pipeline approach, using a leave one-out cross-validation procedure (the data was scarce so there was a need to reduce dimensionality). In order to select the best fitting algorithm, there were several processing pipelines, and for each one of them a different machine learning model was fitted: artificial neural network (multilayer perceptron), logistic regression, naive bayes, random trees, etc.

The conclusions of the master thesis were that Alzheimer's disease has potential to be predicted by using fMRI in a population at risk for developing it (mild cognitive impairment) with a decent accuracy. This can be done using machine learning and an abstract computational neuroscience concept called functional connectivity (an indirect measure on how interconnected a brain is). However, further research should be done and results are to be interpreted with a grain of salt, given that although best fitting model accuracy was good and specificity was high, the sensitivity of the classification was low, probably due to the low sample size of mild cognitive impairment patients who would turn into Alzheimer’s disease as opposed to the sample size of those who wouldn’t.

Tech stack: Python 3. Used modules: sklearn (machine learning), matplotlib (data visualization), seaborn (enhanced layer for matplotlib), numpy (matrix laboratory), pandas (data science). In order to write my final thesis I used LaTeX as, to the best of my knowledge, is the highest typographic quality system available as open source.

Abstract (for a more technical read)

In the last years scientists have tried to develop predictive models for forecasting a future onset of Alzheimer's disease (AD) in people with Mild Cognitive Impairment (MCI), using fluid, imaging and neuropsychological biomarkers. To the best of our knowledge, the question of whether functional magnetic resonance imaging (fMRI) can serve as a viable predictor for the aforementioned disease in MCI individuals remains unanswered. We have developed a single-subject predictive model using, for each individual, solely a rsfMRI scan obtained at the screening visit of the ADNI and follow-up longitudinal data about future outcomes (AD presence/absence). We included in our model 23 MCI-c (mean time until conversion: 1.65 years) and 51 MCI-nc patients (mean time of follow-up: 4.61 years). Scan preprocessing pipelines consisted on registering each scan to Shen's Atlas (214 ROIs), extract functional connectivity measures and use them as predictors (either with or without dimensionality reduction) and pair them with the future outcomes to train and test supervised machine learning models via 10-fold cross-validation. We found conversion from MCI to AD can be predicted from rsfMRI with a multilayer perceptron with no dimensionality reduction with a reasonable accuracy (77.03%, 95% CI from 67 to 87%), good ROC AUC (0.81), very high specficity (specMLP = 90.20, 95% CI from 83 to 97%) but weak sensitivity (47.83%, 95% CI from 36 to 59%). Logistic regression and the Support Vector machines also obtained reasonable diagnostic accuracies (75.68%, both of them). The models cannot be deployed to clinical practice yet: further research is needed to increase sensitivity.

Folders and files to comment

The most important files and folders the reader can refer to, to see either the full thesis or the code with which I got the results from the reader can be headed to any or all of the following folders or files (see thorough explanation for each relevant file):

TFM_FINAL_santiagosanchez.pdf: the final thesis as was handed in to the university.
Writing: Contains the LaTeX files and figures that were put together in order to compile * TFM_FINAL_santiagosanchez.pdf *

DataAnalysis: this folder contains the most important programs I made in Python to do the analysis. Within the sub folder inContext you'll find the same programs, but in context with some of the raw data behind (and other files that contributed to the development of the final thesis)². The most important files are linked here and you can look into them:

mira_dades.py: In this script the most important functions are these:
1. Merge two datasets with complementary information: you can see IMPORTANT_depura_ADNI_MERGE_i_ARXIU_fMRIs()
2. See which subjects convert to alzheimer's and which do not: see temps_de_conversio(SUBMODALITAT_fMRI)

a. fes_merge_en_tensor_3d (obte l'array 3d).py: In this file the most important snippet of code is the funcion fes_merge(n_timeseries) (see down below).

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/a. fes_merge_en_tensor_3d (obteArray3d).py

Lines 39 to 114 in bd8e13d

    
           def fes_merge(n_timeseries): 
        
           	""" 
        
           	GUARDATS. 
        
           		arr_3d --> numpy ndarray amb shape [93,214,140] >-->[SUBJECTES, ROIsSHEN,TIME SERIES] 
        
           		ll_subs =  [SUBJECTE 1, SUBJECTE 2, ... , SUBJECTE 93] 
        
           		ll_uids = [UID 1, UID 2, ..., UID 93] 
        
           	NOTA: 
        
           		les llistes son LA INFO dels subjectes. son correlatives a la primera dimensio (dimensio de subjectes) 
        
           		de la matriu arr_3d. 
        
           		Qualsevol dels dos valors són útils per identificar els subjectes. UID es un identificador unic que hi ha 
        
           		per a cada escaner de l'ADNI.  
        
           	""" 
        
           	j = 0 
        
           	carpetes = os.listdir(os.getcwd()) 
        
           	carpetes.sort() 
        
           	ll_matriu_3d = [] 
        
           	ll = [] 
        
           	ll_subs = [] 
        
           	ll_uids = [] 
        
           	for carpeta in carpetes: 
        
           		if "." not in carpeta: 
        
           			j = j + 1 
        
           			ids = carpeta.split("__") 
        
           			#GUARDO ELS IDS DELS SUBJECTES I ELS LL_UIDS (correlatius a la primera dimensió de arr_3d) 
        
           			ll_subs += [ids[0]] 
        
           			ll_uids += [ids[1]] 
        
           			os.chdir("./"+carpeta+os.sep+"tc") 
        
           			print("Subjecte: "+carpeta) 
        
           			print("\n") 
        
           			ll_matriu_subjecte= [] 
        
           			k = 0 
        
           			with open("total.txt","r") as f: 
        
           				for linia in f: 
        
           					k = k +1  
        
           					ll_linia = linia.split() #fora salts de linia i espais 
        
           					for i in range(len(ll_linia)): 
        
           						ll_linia[i] = float(ll_linia[i]) #passem tipus de dades a real 
        
           					if not len(ll_linia) == n_timeseries: 
        
           						raise ValueError("Compte! Que el subjecte i escaner amb id "+carpeta+" \nno té les "+str(n_timeseries)+" time series que esperaves") 
        
           					ll_matriu_subjecte = ll_matriu_subjecte + [ll_linia] #ll_matriu_subjecte conte matriu (214,140) (ROIs, TS) 
        
           			print("Subj_escan "+carpeta+" || "+"TS (columnes): "+str(len(ll_matriu_subjecte[0]))," | ROIs (files): "+str(len(ll_matriu_subjecte))) 
        
           			os.chdir("../../") 
        
           			ll_matriu_3d = ll_matriu_3d + [ll_matriu_subjecte] #COMPROVA QUE TINGUI [93,214,140] >-->[SUB,ROIs,TS] 
        
           			ll = ll + [k]  
        
           			#if j == 10: 
        
           			#	break 
        
           	print(len(ll_matriu_3d),len(ll_matriu_3d[0]),len(ll_matriu_3d[0][0]))  
        
           	arr_3d = np.array(ll_matriu_3d) #COMPROVA QUE TINGUI [93,214,140] >-->[SUB,ROIs,TS] 
        
           	print("TOTS ELS SUBJECTES HAN DE TENIR ELS MATEIXOS VALORS EN CADA ELEMENT DE LA SEGUENT LLISTA. SI NO, CAL BORRAR TOTAL.TXT, REEXECUTAR MAKEVECTOR I TORNAR A EXECUTAR FES_MERGE") 
        
           	print(ll) 
        
           	#FINALMENT GUARDEM LA NUMPY ARRAY I TAMBÉ ELS NOMS DELS SUBJECTES I ELS UIDS DELS ESCANERS 
        
           	# EN LLISTES CONTINGUDES EN TXTS, CORRELATIVES A LA PRIMERA DIMENSIO DE ARR_3D (DIMENSIO STACK) 
        
           	np.save("__arr_ADNI_3d_preprocessada",arr_3d) 
        
           	f1 = open("__subjectes.txt","w") 
        
           	f2 = open("__uids.txt","w") 
        
           	for i in range(len(ll_subs)): #HAN DE TENIR LA MATEIXA LLARGARIA LES LLISTES LL_SUBS I LL_UIDS. 
        
           		f1.write(ll_subs[i]+'\n') 
        
           		f2.write(ll_uids[i]+'\n') 
        
           	f1.close() 
        
           	f2.close() 
        
           	return "\nshape array subjectes: "+str(arr_3d.shape)+"\nnresubjectes: "+str(len(ll_subs))+"\nnre uids: "+str(len(ll_uids)) 
        
           print(fes_merge(140)) #Comprova que les times series son exactament les que posem al parametre de la funcio per a tots els subjectes. SI no fos el cas, es queixaria i donaria un VAlueerror

On the one hand, this function went to each folder name of syntax "SUBJECTID__UID" and got the "SUBJECTID" (identifier for subject) and "LLUID" (identifier for scanner type made to a given subject) to save them in list objects and finally store them in respective .txt files (__subjectes.txt and __uids.txt). On the other hand, the function also went into each subject folder to find total.txt a file that contained fMRI data extracted from .NIFTI files after registering that information to Shen's atlas³ via FSL. The information inside each subject's total.txt was structured as shape (214,140) (the first dimension being the Regions of interest (ROIs⁴) of Shen's atlas -214- and the second one was the number of time series data points available -140 time series data points for each ROI-) and finally I stacked together all these matrices as a numpy ndarray object (.npy) to have it all within one single file (__arr_ADNI_3d_preprocessada.npy) of shape (93,214,140)⁵.

b. analisisfinal.py: The most important functions of this file are: fes_dataframe_despres_dels_3_criteris_dinclusio(d): This function narrowed down the subjects that went into the machine learning models according to inclusion criteria of the methods section of my final thesis.

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/b. analisisfinal.py

Lines 48 to 243 in 3cfe02d

    
           def fes_dataframe_despres_dels_3_criteris_dinclusio(desv_estandar_de_tall): 
        
           	""" 
        
           	Funcio que elimina part dels 93 subjectes segons els tres criteris d'inclusió 
        
           	que haviem decidit aplicar a l'apartat 2.3.2 Inclusion and exclusion criteria del protocol / apartat del mètode i també els d'exclusió. 
        
           	INPUT: Valor de desviació estandar a sumar a la mitjana de la distribució de temps 
        
           	que el subjecte MCI-C tarda a convertir-se, que serveix com a anys de tall per a  
        
           	la distribució de MCI-NCs 
        
           	RETURNS: pandas dataframe REDUIT, 2 dataframes redundants (que tenen el mateix, pero separats epr MCI-C i MCI-NC) 
        
           	0) COMENÇO AMB EL CRITERI EXCLUSIO: is consistency will be evaluated as the absence of diagnostics that lower in severity after already having been diagnosed as rollover: i.e. if one MCI participant is later diagnosed as AD but in the next follow-up it comes back to being diagnosed as MCI or even diagnosed as Healthy the subject will be excluded. 
        
           	a) Participants that have an fMRI scan of the same submodality. 
        
           	b) Those scans will need to be within at least a \textit{$\pm 2 $ month interval} from the baseline diagnostic (a excpeció dels rollover). FA FALTA JUSTIFICAR B LA DIFERENCIA ENTRE ELS FOLLOW-UP MES PROPERS PER ALS ADNI 1 I LA EXAM DATE 
        
           	c) Among this sample, at least 80\% of the individuals need to have either CSF total, or CSF A$\beta_{1-42} $ or ADAS or MMSE scores to allow a multimodal approach according to \textbf{\textit{$H_{2}$}}.  
        
           	d)  MCI-NC individuals must have been in a minimum of \textbf{$n$ years of a follow-up}. 
        
           		The value $n$ will be chosen by sorting in descending order 
        
           		follow-up times in MCI-NC and eliminating those subjects whose time value is below the mean time  
        
           		of conversion for MCI-C + 1.5 standard deviations". Means and SDs come from the initial 332 MCI patient sample. 
        
           				MCI-C [time until conversion]--->  ($x$ = 2.3 years; $s$ = 1.85) --> value n calculation = 2.3 + 1.85*1 --> minimum of 4.15 years of followup  
        
           				MCI-NC[time until loss of followup]---> ($x$ = 4.69 years; $s$ = 2.357)  
        
           	""" 
        
           	df = pd.read_csv("out_df_final_pleedeADNIMERGE.csv") # --> té 93 subjectes 
        
           	# criterion a) 		son 93 
        
           	registres = df.shape[0] 
        
           	swap = registres 
        
           	print("criteri inclusio a:  inclosos "+str(registres)+" (d entre els 332 de la seccio de metodes).") 
        
           	# EXCLUSION CRITERION  Out of the 93 initial subjects only four present a lowering in severity after an AD diagnosis. Among those the exclusion criteria (see two definitions of "inconsistency" in the text) were applied. 
        
           	#df = df[df.PTID != '002_S_4746']	# WE DO NOT EXCLUDE HIM --> 002_S_4746 (4 consecutius de MCI (amb 2 missings entre mig). Dpesres 1 a AD i inmediatament despres de nou a MCI i depsres perdua followup.  
        
           	df = df[df.PTID != '031_S_4005'] 	# WE MUST EXCLUDE THEM FOR TWO MCI DIAGNOSIS AFTER THE AD ONE! --> 031_S_4005 xx index 1 (12 mesos per conv)-- DOLENT -------------- (First 2 follow-ups AD. Then Goes back to MCI (2 consecutive followups corroborate the MCI and Then follow-up stops.  
        
           	df = df[df.PTID != '019_S_4293']    # EXCLUDE THEM FOR OSCILATING PATTERN --> 019_S_4293 (24 mesos per conv) t (mci --> ad --> mci --> ad) 
        
           	df = df[df.PTID != '031_S_4947']    # EXCLUDE THEM FOR OSCILATING PATTERN -->031_S_4947 x -- POTSER DOLENT ----(MCI --> x --> AD --> MCI --> AD --> loss followup) --> varia, oscila ?? 
        
           	registres = df.shape[0] 
        
           	print("criteri exclusio (inconsistencia):\n   ara queden "+str(registres)+" (dels originals "+str(swap)+" de metodes): \no sigui, "+str(swap - registres)+"exclosos.") 
        
           	# INCLUSION CRITERION 
        
           	#time.sleep(5) 
        
           	#criterion  b) 		Check whether at least 80 % of subjects have +- 2 months difference from bl.dx vs (examdate), EXCEPT those ADNI 1 (because dx.bl was long time ago).			 
        
           	df_no_ADNI1 = df[df["fase_estudi"] != "ADNI_1"] 
        
           	MCIc_seguits_minim_2mesos = df_no_ADNI1[df_no_ADNI1["dif_examdate_vs_dxbl"].abs() <= 60][df["s_CONVER"] == 1].shape[0] 
        
           	MCInc_seguits_minim_2mesos = df_no_ADNI1[df_no_ADNI1["dif_examdate_vs_dxbl"].abs() <= 60][df["s_CONVER"] == 0].shape[0] 
        
           	MCIc, MCInc = df_no_ADNI1[df_no_ADNI1['s_CONVER'] == 1].shape[0], df_no_ADNI1[df_no_ADNI1['s_CONVER'] == 0].shape[0] 
        
           	print("CRITERI EXCLUSIO B requereix 80% min") #NO BORREM A NINGU 
        
           	print(" percentatge MCIc amb <60 dies bl vs examdate: ", MCIc_seguits_minim_2mesos/MCIc) #90% 
        
           	print(" percentatge MCIncc amb <60 dies bl vs examdate: ", MCInc_seguits_minim_2mesos/MCInc) #91 % 
        
           	#time.sleep(2) 
        
           	# CROTERION c) 		defined below (we cut)! 
        
           	# STEP 1) SPLIT MCI_C and MCI_NC.  
        
           	df_MCIc, df_MCInc = df[df['s_CONVER'] == 1], df[df['s_CONVER'] == 0] #26 MCI_C i len 67 MCI_NC respectivament 
        
           	# STEP 2) Delete MCI_NC with follow up times below the one defined in methods section (2.3 + 1.85 * 1 = 4.15 years of minimum follow up) 
        
           	# This is done to delete those MCI_NC who are prone to be false negatives (see -threshold n- listed and below) 
        
           	n = 2.3 + 1.85 * desv_estandar_de_tall  #YEARS THRESHOLD CUT-OFF 
        
           	df_MCInc, df_MCInc_NOINCLOSOS_PERCRITERI_C = df_MCInc[df_MCInc["DIES-SEGUIM"] > n*365.25], df_MCInc[df_MCInc["DIES-SEGUIM"] <= n*365.25] 
        
           	#df_MCIc = df_MCIc[df_MCIc["DIES-SEGUIM"] >= 365]  Amn aixo pots tallar els MCInc per damunt o per sota d'un temps de conversio. Empitjora la sensibilitat. 
        
           	#       REPORTING INFORMATION FOR INCLUSION CRITERION D. 
        
           	print('{:.0f} selected MCInc with follow-up time above cutoff in *years\n{:.0f} selected MCIc'.format(len(df_MCInc),len(df_MCIc))) 
        
           	print("* "+str(n)+" punt de tall -en anys- de minim follow-up assegurat per a als MCI_nc") 
        
           	print('El cutoff de n fa excloure un total de '+str(len(df_MCInc_NOINCLOSOS_PERCRITERI_C['PTID'].tolist()))+" subjectes (concretament... DESMARCACOMENTARI I MIRA "+str(df_MCInc_NOINCLOSOS_PERCRITERI_C['PTID'].tolist())) 
        
           	time.sleep(1) 
        
           	# STEP 3) WE CREATE a df with two keys AND ALSO A NEW CSV for analysis with other programs (such as SPSS) 
        
           	df = pd.concat([df_MCIc, df_MCInc]) 
        
           	# bonus: WE ADD EXACT SUBJECT_AGES AT THE MOMENT OF ACQUIRING THE FMRI SCANS! (THEY COME FROM XML FILES ) 
        
           	df_xmls = pd.read_csv("in__resum_parametres_fmri_EDATS_EXACTES_EN_MOMENTS_D'ESCANER.csv") 
        
           	df_uid_edat = df_xmls[['UIDs','AGE_ambdecimals']] 
        
           	df = pd.merge(df,df_uid_edat, how = "left", on="UIDs",sort=False) 
        
           	#bonus 2: WE WANT TO KNOW THE TIME THAT GOES FROM SCANNING TIME AND MOMENT OF END OF FOLLOW UP(MCI-NC)/CONVERSION (MCI-C).  
        
           	# EXAM-DATE VARIABLE GIVES US THE TIME THAT GOES FROM BL.DX UNTIL END OF FOLLOUP / CONVERSION, and it is pretty informative  
        
           	# for all participants, except those who do not belong to the ADNI 1 originally. For those DIES-SEGUIM are way too big. 
        
           	# Then we have dif_examdate_vs_dxbl, which gives us the diference examdate (scan time) vs baseline-diagnostic.  
        
           	df['FI-FOLLOWUP-O-CONV_vs_data-escaneig'] = pd.Series((df["DIES-SEGUIM"] - df["dif_examdate_vs_dxbl"])/365.25) 
        
           	# Thus, to know the time SINCE rsfMRI scan until conversion/loss of followup for all subjects we simply need to compute the new variable 
        
           	# doing DIES-SEGUIM - dif_examdate_vs_dxbl will. NEW VARIABLE will be called ENDFOLLOW-or-CONV_vs_data-escaneig  
        
           	#EVALUEM QUINS CENTRES HI HA  
        
           	print("CENTRE, FREQUENCIA PER CENTRE") 
        
           	print(df["SITE"].value_counts()) 
        
           	#EVALUEM QUINS CENTRES HI HA (GRUP MCI-C) 
        
           	print("CENTRE, FREQUENCIA PER CENTRE (MCI-c)") 
        
           	print(df["SITE"][df["s_CONVER"]==1].value_counts()) 
        
           	#EVALUEM QUINS CENTRES HI HA (GRUP MCI-nc) 
        
           	print("CENTRE, FREQUENCIA PER CENTRE (MCI-nc)") 
        
           	print(df["SITE"][df["s_CONVER"]==0].value_counts()) 
        
           	time.sleep(1) 
        
           	#guardem el df a csv  
        
           	df["ABETA"] = df["ABETA"].replace(to_replace = ">1700", value = '1701').astype(float) #treiem primer el caracter especial superior o igual que dels valors ">1700" i ho substituim per un valor factible: 1701] cal passar a float perquè despres de la conversio tot esta en string i no se pq 
        
           	df["ABETA/PTAU"] = df["ABETA"]/df["PTAU"]	# [creem la nova variable ABETA/PTAU, que suposadament es un index que podria contribuir al diagnostic d'acord amb la lite previa (compte que no sabem si es abeta 1 42, crec que es abeta general)] 
        
           	df.to_csv("out_df_final_pleedeADNIMERGE_postcriterisInclusio.csv",sep = ",", index = False) 
        
           	#fem estadística descriptiva per variables importants dins ll_variables 
        
           	v_quantis=[('escan_vs_fi','FI-FOLLOWUP-O-CONV_vs_data-escaneig'),("AGE","AGE_ambdecimals"),("MMSE","MMSE"),("TAU","TAU"),("PTAU","PTAU"),("ABETA","ABETA"),("AB/PTAU","ABETA/PTAU"),("FDG","FDG"),("ADAS11","ADAS11"),("ADAS13","ADAS13"),("ADASQ4","ADASQ4")]#[(label_variable, nom_variable_dins_Dataframe)"}] 
        
           	v_categ = [("EMCI/LMCI","BL.dx"),("MALE/FEMALE","PTGENDER")] 
        
           	# criterion d) 		 
        
           		# CHECKED MANUALLY. no subjects excluded. dins els MCIc 1/23 (96%) no tenien tau, ptau o beta a. I dins els MCInc eren 3/51 (94%) 
        
           	#__________INFORMEM DELS NaNs__________________________ 
        
           	print("\n##############################") 
        
           	print("_____NaNs per variable_______") 
        
           	for variable in v_quantis: 
        
           		label_variable, nom_variable_dins_Dataframe = variable 
        
           		nre_NaNs_a_variable = df[variable[1]].isna().values.tolist().count(True) 
        
           		print(variable[0],": ",nre_NaNs_a_variable," sobre "+str(len(df))+"({:.2f} %)".format(100*nre_NaNs_a_variable/len(df))+")") #Conto els NaNs 
        
           	time.sleep(1) 
        
           	#______Evaluem normalitat i homocedasticitat de les proves per a cada variable quantitativa en cada subgrup (MCI-C i MCI-NC) (condicions aplicacio t-test!)_________ 
        
           	print("\n                SHAPIRO-WILK (NORMALITAT)          |||     levene ") 
        
           	##https://es.wikipedia.org/wiki/Test_de_Shapiro%E2%80%93Wilk | 
        
           	print("                   MCIc                 MCInc      |||   MCIc vs MCInc ") 
        
           	t_test = [] #guardo les variables que requereixen ttest. les altres fem mann whitney. 
        
           	for variable in v_quantis: 
        
           		label_variable, nom_variable_dins_Dataframe = variable 
        
           		variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna() 
        
           		shapiro_MCIc, p_shapiro_MCIc = stats.shapiro(variable_MCIc);  shapiro_MCInc, p_shapiro_MCInc = stats.shapiro(variable_MCInc) 
        
           		levene_statistic, p_levene = stats.levene(variable_MCIc, variable_MCInc) 
        
           		if p_shapiro_MCIc < 0.05 or p_shapiro_MCInc < 0.05: #Si la variable NO ES NORMAL segons shapiro wilk 
        
           			if p_levene < 0.05: #si no hi ha homogeneitat de variances 
        
           				print('{:7}   {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f})* ||| {:.2f} (p = {:.3f})+[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene)) 
        
           				t_test += [False]		 
        
           			else:#si hi ha homogeneitat de variances 
        
           				print('{:7}   {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f})* ||| {:.2f} (p = {:.3f})[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene)) 
        
           				t_test += [False] 
        
           		else:#si la variable es normal segons shapiro wilk 
        
           			if p_levene < 0.05:#si no hi ha homogeneitat de variances 
        
           				print('{:7}   {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f}) ||| {:.2f} (p = {:.3f})+[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene)) 
        
           				t_test += [False] 
        
           			else: #si hi ha homogeneitat de variances 
        
           				print('{:7}   {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f})  ||| {:.2f} (p = {:.3f}) [t]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene)) 
        
           				t_test += [True] 
        
           	print("_____________________________________________")  
        
           	print("Ho (shapiro wilk): Poblacio del subgrup donat, per a la variabledonada, \nesta normalment distribuida. L'Asterisc surt quan refutem Ho en qualsevol dels dos grups, o sigui, quan qualsevol dels dos grups no mostra normalitat.") 
        
           	print("Ho (levene_test): els dos subgrups tenen homogeneitat de variances per a la variable donada. La creu + surt quan refutem Ho, o sigui quan no hi ha homogeneitat de variances.") 
        
           	print("[t] apareix quan la variable es susceptible de comparar-se entre grups amb t-test (exigeix homocedasticitat entre subgrups i distrib normal dins cada subgrup)") 
        
           	print("[?] Si no hi ha homogeneitat de variances pero la distribucio es normal... aleshores no se si aplicar mann whitney o t-test...") 
        
           	#__________CREEM LA TAULA 1 DE DESCRIPTIUS_____________ 
        
           	print("\n      ***********************  DESCRIPTIUS  ***********************") 
        
           	print("\n#################################################################") 
        
           	print("__________MCI-C___________________MCI-NC__________valor (p-value)") 
        
           	# PRIMER VAN LES QUANTITATIVES 
        
           	for i in range(len(v_quantis)): 
        
           		variable = v_quantis[i] 
        
           		label_variable, nom_variable_dins_Dataframe = variable 
        
           		variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna() 
        
           		valor_contrast_hipotesis, p_valor, boolea_t_test = funcions_auxiliars.bivariant_comparacio_mitjanes(variable_MCIc, variable_MCInc, t_test[i]) 
        
           		if boolea_t_test: 
        
           			print("{:10}".format(label_variable), "{:.1f} ({:.1f})        {:.1f} ({:.1f})      {:.2f} (p = {:.5f})[t]   ".format(variable_MCIc.mean(), variable_MCIc.std(), variable_MCInc.mean(), variable_MCInc.std(),valor_contrast_hipotesis, p_valor)) 
        
           		else: 
        
           			print("{:10}".format(label_variable), "{:.1f} ({:.1f})        {:.1f} ({:.1f})      {:.2f} (p = {:.5f})[mw]   ".format(variable_MCIc.mean(), variable_MCIc.std(), variable_MCInc.mean(), variable_MCInc.std(),valor_contrast_hipotesis, p_valor)) 
        
           	#DESPRES LES CATEGORIQUES 
        
           	for variable in v_categ: 
        
           		label_variable, nom_variable_dins_Dataframe = variable 
        
           		variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna() 
        
           		if label_variable == "EMCI/LMCI": 
        
           			EMCIs_MCIc, LMCIs_MCIc, EMCIs_MCInc, LMCIs_MCInc = variable_MCIc[variable_MCIc == 3].count(), variable_MCIc[variable_MCIc == 4].count(), variable_MCInc[variable_MCInc == 3].count(), variable_MCInc[variable_MCInc == 4].count() 
        
           			valor_contrast_hipotesis, p_valor = 99.99, 0.999 
        
           			print("{:10}".format(label_variable), "{:10}        {:10}    {:.2f} (p = {:.5f})[?]   ".format(str(EMCIs_MCIc)+"/"+str(LMCIs_MCIc), str(EMCIs_MCInc)+"/"+str(LMCIs_MCInc), valor_contrast_hipotesis, p_valor)) 
        
           		if label_variable == "MALE/FEMALE": 
        
           			male_MCIc, female_MCIc, male_MCInc, female_MCInc = variable_MCIc[variable_MCIc == "Male"].count(), variable_MCIc[variable_MCIc == "Female"].count(), variable_MCInc[variable_MCInc == "Male"].count(), variable_MCInc[variable_MCInc == "Female"].count() 
        
           			valor_contrast_hipotesis, p_valor = 99.99, 0.999 
        
           			print("{:10}".format(label_variable), "{:10}        {:10}    {:.2f} (p = {:.5f})[?]   ".format(str(male_MCIc)+"/"+str(female_MCIc), str(male_MCInc)+"/"+str(female_MCInc), valor_contrast_hipotesis, p_valor)) 
        
           	print("________________________________________________________________") 
        
           	print("[t] son t-test per a mostres independents. \n [mw] son Mann whitney") 
        
           	print("**COMPTE! Per a aquestes proves de comparacio de mitjanes els NaNs s'han eliminat. Diferent al que farem per al multimodal aproach, que sera mean o median imputation per a cada subgrup.") 
        
           	print("? te les odds... pero podria posar proporcions crec") 
        
           	print("________________________________________________________________")	 
        
           	print("################################################################") 
        
           	time.sleep(1) 
        
           	return df, df_MCIc, df_MCInc

The most important inclusion criteria within that function was how I decided to filter out the subjects who would not convert to Alzheimer's (see the footnote to understand the reasoning that was to be followed⁶). The idea was to choose the criterion of considering those who wouldn't turn to alzheimer's (MCI-nc) as those subjects who hadn't turn to alzheimers above the time period $t$, such that: $t > \mu + n\rho$

Where

  μ = 2.3 years (average time of turning to alzheimers for those who turn)
  ρ = standard deviation of those who turned to AD
  n = 1.85 (arbitrary number, but if we consider the distribution of time for MCIc to be normal, that would way above the central tendency of the data).

In the following code, you can see I stored the subjects I would finally analyse within the df_MCInc dataframe, by writing this:

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/b. analisisfinal.py

Lines 108 to 111 in a247922

    
           # STEP 2) Delete MCI_NC with follow up times below the one defined in methods section (2.3 + 1.85 * 1 = 4.15 years of minimum follow up) 
        
           # This is done to delete those MCI_NC who are prone to be false negatives (see -threshold n- listed and below) 
        
           n = 2.3 + 1.85 * desv_estandar_de_tall  #YEARS THRESHOLD CUT-OFF 
        
           df_MCInc, df_MCInc_NOINCLOSOS_PERCRITERI_C = df_MCInc[df_MCInc["DIES-SEGUIM"] > n*365.25], df_MCInc[df_MCInc["DIES-SEGUIM"] <= n*365.25]

mat_corr_labels_i_pipeline(): This function took the numpy ndarray arr_TS with shape (57,214,140) (57 subjects with 214 ROIs each and 140 Time Series per ROI) applied the Functional Connectivity definition. The Functional Connectivity (FC) is a construct that tells how well connected a brain is. Here, the FC of one given subject was defined as all possible correlations between the time series of the subject's ROIs, vectorized in a single line. Namely, for each matrix (n,m) = (214,140) we would get a (n,n) = (214,214) pearson correlation matrix (or the adjacency matrix of a complete graph with 214 nodes) called arr_cor and stored as out_FC(57 subjectes).npy⁷ as obtained in this lines of code:

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/b. analisisfinal.py

Lines 305 to 321 in 3cfe02d

    
           def mat_corr_labels_i_pipeline(ploteja_totes_les_matrius_de_conectivitat, ploteja_una_matriu_de_conectivitat_mitjana_per_cada_grup, ploteja_funcions_densitat_probabilitat_FCON_per_centre): 
        
           	""" 
        
           	arr_TS, ll_sub_txt, ll_uids i df tenen dimensions correlatives. 
        
           	""" 
        
           	t1 = time.clock() 
        
           	arr_TS, ll_sub_txt, ll_uids, df = redueix_i_organitza() #arr_TS --> (57,214,140)  (SUB, ROI, TS) 
        
           	arr_TS = np.transpose(arr_TS,(0,2,1)) # --> arr_TS ara te shape (57,140,214)  (SUB, TS, ROI) 
        
           	#CREO VARIABLE arr_corr, que contindrà un stack de matrius de correlacions (tantes com subjectes) 
        
           	arr_corr = [] 
        
           	for i in range(len(arr_TS)):  
        
           		ll_arr_corr = pd.DataFrame(arr_TS[i]).corr().values.tolist() #ll_ARR_CORR --> (214,214)) --> faig pas DF --> NDARRAY --> LIST per poder fer l'stack bé (no se com fer-ho en numpy) 
        
           		arr_corr += [ll_arr_corr] 
        
           	arr_corr = np.array(arr_corr) # shape (57,214,214) (Shape pot canviar en funcio del parametre de desviacio tipica, recorda) 
        
           	np.save("out_FC ("+str(len(arr_corr))+" subjectes)", arr_TS) 
        
           	print("arr_corr shape", arr_corr.shape)

And after that, we would take the arr_corr for each $i$ subject in a for loop (arr_corr[i]) and obtain the flattened vector with the pearson correlations called ll_vector and stored within the X matrix that will go to the machine learning model input matrix. ll_vector will have shape $(1/2) \cdot n(n-1)$:

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/b. analisisfinal.py

Lines 374 to 379 in a247922

    
           #CREO VARIABLE X per a a fer ML. X conté la F.Connectivity vectoritzada, o sigui conté 
        
           #CADA MATRIU DE l'stac arr corr queda vectoritzada i guardada com a matriu 2D  
        
           X = []  
        
           for i in range(len(arr_corr)): 
        
           	ll_vector = funcions_auxiliars.elimina_tri_superior_i_flateneja_ROIxROI(arr_corr[i]) 
        
           	X += [ll_vector]

The auxiliary function call in line 378 that allows us to do obtain ll_vector (ll_ROIxROI_flattened within the function) is here:

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/funcions_auxiliars.py

Lines 29 to 44 in 7592ad1

    
           def elimina_tri_superior_i_flateneja_ROIxROI(m_correl): 
        
           	""" 
        
           	PRE: This method takes a numpy array that contains a matrix of correlationsnxn called m_correl (ROIxROI) 
        
           	POST: Returns ll_ROIxROI_flattened (i.e upper triangle of m_correl and the main diagonal gets eliminated, and the lower triangle vectorized and 
        
           	returned as a a list type object as a ROW of (1/2) * n (n-1) correlations (elements). Thus, If m_correl input size us 214x214 -shen atlas- you will  
        
           	get 22,791 valies within ll_ROIxROI_flattened 
        
           	NOTE: This Does the same as numpy.tril function 
        
           	""" 
        
           	#m_correl es dataframe de pandas. com que vull operar com a llista... || FAIG A LA SEGUENT LINIA LE PAS dataframe ----> array numpy ----> llista 
        
           	ll_correl = m_correl.tolist() #ndarray --> llista  
        
           	ll_ROIxROI_flattened = [] 
        
           	for i in range(1,len(ll_correl)): 
        
           		bufer = ll_correl[i][:i] 
        
           		ll_ROIxROI_flattened = ll_ROIxROI_flattened + bufer 
        
           	return ll_ROIxROI_flattened

Finally we obtain the Y matrix, with the labels (wheter or not the MCI will convert to AD or no, with a 1 or 0 respectively) and run the machine learning models:

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/b. analisisfinal.py

Lines 403 to 417 in 60de1f5

    
           #CREO VARIABLE Y, que contindrà les labels (1 --> MCI-C | 0 --> MCI-NC)	 	 
        
           Y = df["s_CONVER"].values;	print("Y shape: ", Y.shape) # Y te shape (57,) 
        
           print("Y shape: ",Y.shape)  
        
           time.sleep(2) 
        
           #ANALISI INDEPENDENT 2)  FEM 10-fold-crossvalidation + SVM --> RESULTATS MOLT POBRES AMB PCA, tant estandaritzant com sense fer-ho. No diagnostica cap malalt, specificiat 0 amb PCA. probats varis components 
        
           apc.PCA_10foldCV_ROC(X=X,  
        
           					 y=Y,					# NOTA_ PCA_by_santi i rfecv son mutuament excloents 
        
           					 PCA_by_santi=False, 	# primer boolea True si vols fer PCA (false si no el fas)#Si vols estandaritzar columnes en fer el PCA (Si no el fas es igual si es True o false)has d'anar a dins el fitxer i buscar "estandaritzar_variables" 
        
           					 rfecv=False,  			# no funciona... si poses True hauria de fer Recursive feature elimination. Si poses False no te la fa 
        
           					 fes_violin_plot_accuracies_per_folds_i_classificador=True, 
        
           					 carpeta='BIOMARCADORS_models_finals')	# SI TOQUES COSES DE fMRI treu-ho a la carpeta 'fMRI_models_finals' 
        
           					 #NOTA: LA CARPETA ON GUARDARAS TOTES LES ACCURACY METRICS. Ha d'existir primer o dona error.

apc.PCA_10foldCV_ROC(X=X .. ) is the function call to the PCA_10foldCV_ROC of another file, which are described down below:

aux__plot_roc_crossval.py: As we said, this is the file where the machine learning models are run. 10-fold-crossvalidation, PCA, Artificial Neural networks., etc. Within this file the most important function is the one we called at line 410 in the previous file above and goes as follows:

MSc-FinalThesis/DataAnalysis/inContext/DADES_PREPROCESSADES_FINALS/aux__plot_roc_crossval.py

Lines 64 to 307 in 7592ad1

    
           def PCA_10foldCV_ROC(X,y,PCA_by_santi,rfecv,fes_violin_plot_accuracies_per_folds_i_classificador,carpeta): 
        
               from aux__reduccio_dimensions import Analisi_components_principals as dades_acp 
        
               """ 
        
               X = Dades d'entrada  
        
               y = labels 
        
               PCA = BOOLEA sobre si vols PCA o no abans de introduir aquestes dades al model 
        
               RFECV = BOOLEA sobre si vols aplicar Recursive Feature Elimination per a cada classificador o no 
        
               """ 
        
               #n_samples, n_features = X.shape 
        
               n_samples = X.shape[0] 
        
               ################################## Classification and ROC analysis######################### 
        
               #Posa els noms dels classificadors i els noms dels objectes. Aixi pots iterar en cada un d'ells i fer una 10-fold CV per a cada un. 
        
               noms_classificadors = ["Logistic Regression", 
        
                                       "Gaussian Naive Bayes", 
        
                                       "Support Vector Machines (linear kernel)",  
        
                                       "Nearest Neighbours", 
        
                                       "Multilayer perceptron (ANN)"] 
        
               objectes_classificadors = [LogisticRegression(C=10000, penalty='l2', multi_class= 'multinomial', solver='lbfgs', class_weight = "balanced"), 
        
                                           GaussianNB(),  
        
                                           svm.SVC(kernel='linear', probability=True, random_state=1, class_weight = "balanced"), 
        
                                           KNeighborsClassifier(3),  
        
                                           MLPClassifier(alpha=1, hidden_layer_sizes=(100,), activation='relu', solver='sgd')] #PROVA ADAM 
        
               #aqui guardo les accuracies per classificador, i en faig un violin plot després. 
        
               #NOTA QUE AMB AIXO PUC CREAR FACILMENT UN PANDAS DATAFRAME I FER EL VIOLINPLOT DESPRÉS 
        
               dic_acc_per_fold_i_classificador = {} #{"Logistic Regression":[FLOAT_accuracy_fold_1, FLOAT_accuracy_fold_2, [...], FLOAT_accuracy fold 10], "Gaussian Naive Bayes":[], ..., "Multilayer perceptron (ANN)":[]} 
        
               #CREEM ELS FITXERS ON GUARDAR LES METRIQUES PER CADA MODEL (TANT LES MESURES DE DIAGNOSTIC ACCURACY COM LES BRIER SCORES) 
        
               f1 = open("./"+carpeta+"/LATEX_acc_s_e_ppv_npv_lrmes_lrmenys_PERCADAMODEL_.txt","w"); f1.write("\n\\toprule\nclassifier&accuracy&sensibility&specificity&ppv&npv&LR+&LR-\n\midrule\\\\\n") #fes la capsalera per a latex 
        
               f2 = open("./"+carpeta+"/LATEX_brierScores_PERCADAMODEL_.txt","w"); f2.write("\\toprule\nBrier Scores & H.L. test\\\n\midrule\n") #fes la capsalera per a latex 
        
               f4 = open("./"+carpeta+"/intervalsConfiansaIconfusionsMatrix.txt","w") 
        
               f_readme = open("./"+carpeta+"/README_PREDICT_PROBAS.txt","w") 
        
               f_readme.write("L'arxiu PREDICT_PROBA_NOMMODEL conte columnes separades per espais blanc segons:\n\n'PREDICCIO' --> (0 mci-nc, 1 mci-c)\n'LABEL'--> (idem a valor predit pero el valor que prove de ADNIMERGE, el valor real)\n'P_mci_nc'--> prob de pertanyer a grup 0 o mci-nc.\n'P_mci_c' --> probabilitat de pertanyer a grup MCI-c (1-p) o grup codificat amb 1.\n'Fold' --> el fold la crossvalidation don prove el subjecte.\n\nNoteu que aquest arxiu ajunta tots els folds de la cross validation i tindra tants subjectes\ncom subjectes hi hagi a la mostra. Cal tenir en compte que l'ordre es preserva perque CV.split() -on CV es objecte Stratified k-fold-\nens permet aleatoritzar sempre de la mateixa\nmanera, perque random_state=None per defecte.\nAixo permet que aquestes dades es puguin prendre per fer un multimodal approach de late integration.") 
        
               f_readme.close() 
        
               matriu_violinplot_sensibEsp = [] 
        
               ############################################################################################ 
        
               #ITEREM EN CADA CLASSIFICADOR. FEM 10 FOLD CROSS VALIDATION. FEM LES ROC CURVES PER CADA CLASSIFICADOR. 
        
               ll_sens_espe_classificador = [] #per fer violinplot sens i espec 
        
               for i in range(len(objectes_classificadors)): 
        
                   f3 = open("./fMRI_models_finals/PREDICT_PROBA_"+noms_classificadors[i].replace(" ","_")+".csv","w") #GUARDO PREDIT, GS, 
        
                   f3.write("PREDICCIO,LABEL,P_mci_nc,P_mci_c\n") 
        
                   clf, nom_clf =  objectes_classificadors[i], noms_classificadors[i] #carrego l'objecte classificador i l'string que descriu cada classificador 
        
                   CV = StratifiedKFold(n_splits=10) 
        
                   #IMPLEMENT RECURSIVE FEATURE ELIMINATION [NO FUNCIONA] 
        
                   if rfecv:         
        
                       clf = RFECV(estimator=clf, step=1, cv=CV, scoring='accuracy') 
        
                   tprs, aucs = [], []; mean_fpr = np.linspace(0, 1, 100) #COSES DE LA PLANTILLA 
        
                   #creem llistes buides per acumular els true negatives, flase positives, false negatives i true positives per a despres fer una confusion matrix final. 
        
                   suma_tn, suma_fp, suma_fn, suma_tp = 0,0,0,0 
        
                   accuracies_mitjanes_mal_procediment = 0  
        
                   ll_acc_classificador = []  #creo llista buida 
        
                   j = 1 #per contar els folds 
        
                   #ITEREM EN ELS DIVERSOS FOLDS CREATS PER SKLEARN, DINS CADA CLASSIFICADOR 
        
                   y_certes, y_predites = [], [] #per a crear les brier scores per a cada classificador 
        
                   for train, test in CV.split(X, y): #CSEPARA ELS INDEX  
        
                       #OJO AMB ELS EFECTES ALIAS! PER AIXO LA COPIO... 
        
                       X_entrenament = np.copy(X[train])  
        
                       X_test = np.copy(X[test]) 
        
                       if PCA_by_santi: 
        
                           estandaritza_variables = False #POSA TRUE SI VOLS ESTANDARITZAR LES VARIABLE SPER COLUMNES. POSA FALSE EN CAS CONTRARI 
        
                           dades = dades_acp(np.array(X_entrenament,dtype = np.float64), nombre_components = None, estandaritza_variables = estandaritza_variables, llista_variables=[]) 
        
                           #dades.matriu_correlacions(grafic_correlacions = False)  # ACP es basa en la matriu de corr. davant absencia de proves de esfericitat (bartlett,KMO) es importantissim mirar que la matriu de correlacions no s'apropa a la matriu identitat. Volem variables correlacionades o no te sentit aplicar PCA 
        
                           #dades.autovalors_i_variansssa(scree_plot=True)       # veure autovalors, variança explicada i screeplot. Amb això (especialment amb l'screeplot), decideix quants components retenir combinant variança explicada i parsimònia. 
        
                           #dades.carregues_factorials()                        # veure les correlacions entre cada factor i cada variable 
        
                           dades.carregues_factorials()#vaig fer un castell de naipes i cal execitar aquesta funcio si o si, o la eguent no va. 
        
                           X_entrenament, ll_mitjana_train, ll_desvest_train, carregues_factorials_train = dades.puntuacions_factorials(imprimeix_dataset_reduit = False) # METODE QUE RETONRA EL DATAET REDUIT. ndarray reduit! em prenc les mitjanes i desviacions estandar estimades per a fer la estandaritzacio des de les dades d'entrenament 
        
                           if estandaritza_variables: 
        
                               X_test = ((X_test - ll_mitjana_train) / ll_desvest_train).dot(carregues_factorials_train) #apliquem column wise les operacions   
        
                           else: 
        
                               print(X_test.shape) 
        
                               print(carregues_factorials_train.shape) 
        
                               X_test = X_test.dot(carregues_factorials_train) # X_test * carregues_factorials_train  -->  https://stackoverflow.com/questions/24560298/python-numpy-valueerror-operands-could-not-be-broadcast-together-with-shapes 
        
                               print(X_test.shape) 
        
                           print("###########################") 
        
                           print("X_entrenament: ",X_entrenament.shape) #(57,ncomponents) 
        
                           print(len(ll_mitjana_train))#(22791,)  
        
                           print(len(ll_desvest_train)) #(22791,) CONTE DESVIACIONS TIPIQUES DUTES DE TRAIN 
        
                           print(carregues_factorials_train.shape) #(22791,components) 
        
                           print("X_test_post pca: ",X_test) #9,ncomp  
        
                           print("###########################") 
        
                       print("Ara entrenant: ",nom_clf) 
        
                       ##### CREO ESTADISTIQUES PER A CADA FOLD (DIAGNOSTIC ACCURACY) I LES VAIG GUARDANT PER CREAR LA CONFUSION MATRIX FINAL  ######### 
        
                       print("###### FOLD"+str(j)+" ######") 
        
                       print("#############################") 
        
                       #################################################################### 
        
                       clf.fit(X_entrenament, y[train])    #pas 1 --> entreno les dades, per a cada fold 
        
                       probas_ = clf.predict_proba(X_test) #pas 2 --> testejo el model a les dades d'entreno (TREU MATRIU PROBABILITATS [p, 1-p] per a cada fold) 
        
                       y_predits_test = np.argmax(probas_, axis=1) 
        
                       ################################################################### 
        
                       #DEFINEIXO LA CONFUSION MATRIX PER A CADA FOLD I IMPRIMEIXO ELS INDEXOS  
        
                       tn, fp, fn, tp = confusion_matrix(y[test], y_predits_test).ravel() 
        
                       accuracies_mitjanes_mal_procediment += (tp+tn)/(tp+tn+fp+fn) 
        
                       acc, s, e = (tp+tn)/(tp+tn+fp+fn), tp/(tp+fn), tn/(fp+tn) 
        
                       imprimeix_diagnostic_accuracy_a_cada_fold = True 
        
                       for k in range(len(probas_)): 
        
                           p, u_menys_p = probas_[k] 
        
                           predit, gold_standard = y_predits_test[k], y[test][k] 
        
                           f3.write(str(predit)+","+str(gold_standard)+","+str(p)+","+str(u_menys_p)+","+str(j)+'\n')#PREDICCIO,LABEL,P_mci_nc,P_mci_c,fold 
        
                           y_certes += [gold_standard]#per calcular brier scores 
        
                           y_predites += [predit]#per calcular brier scores 
        
                       if imprimeix_diagnostic_accuracy_a_cada_fold: 
        
                           print("probabilitats  ||  y_predits_Test ||   y_labels_test") 
        
                           print(probas_,np.argmax(probas_,axis=1),y[test]) # FET PER MI_ IMPRIMEIXO PROBABILITATS, PROBABILITATS TRANSFORMADES A VALOR MAX (ARGMAX) I LABELS) 
        
                           print("\n#######################") 
        
                           print("Accuracy: {:.5f}".format(acc)) 
        
                           print("Sensibilitat: {:.5f}".format(s)) 
        
                           print("Especificitat: {:.5f}".format(e)) 
        
                           print("Valor predictiu + (PPV): {:.5f}".format(tp/(tp+fp))) #tambe anomenat precision score. hi ha una funcio a sklearn que ho fa 
        
                           print("Valor predictiu - (NPV): {:.5f}".format(tn/(fn+tn))) 
        
                           print("LR+: {:.5f}".format(s / (1 - e))) 
        
                           print("LR-: {:.5f}".format((1 - s)/e)) 
        
                           print("#######################") 
        
                           print("tn,fp,fn,tp: ",tn,fp,fn,tp) 
        
                           print("#######################\n")       
        
                       if fes_violin_plot_accuracies_per_folds_i_classificador: 
        
                           ll_acc_classificador += [acc] 
        
                           ll_sens_espe_classificador += [[noms_classificadors[i],"sensitivity",s]] #faig llista [[clsf[i], s1, e1],[[clsf[i], s2, e2],...,[clsf[i], sk, ek]] on k es el nombre de folds. Ha de tenir dimensio (50,3) --> 10 folds * 5 CLASSIFICADORS * 2 proporcions per fold (una sens i una esp.) = 100 files | 3 variables (COMPROVAT) 
        
                           ll_sens_espe_classificador += [[noms_classificadors[i],"specificity",e]] 
        
                       #VAIG ACUMULANT ELS VALORS PER A CADA FOLD PER TAL DE TENIR UNA CONFUSION MATRIX FINAL, DESPRÉS DE LA VALIDACIÓ CREUADA 
        
                       suma_tn += tn 
        
                       suma_fp += fp 
        
                       suma_fn += fn 
        
                       suma_tp += tp 
        
                       # Compute ROC curve and area the curve 
        
                       fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1]) #PER QUE DIMONIS TREU DIVERSOS INDEXOS DE FPR i TPR PER A CADA FOLD? SI NOMÉS N'HI HA D'HAVER UN... 
        
                       tprs.append(interp(mean_fpr, fpr, tpr)) 
        
                       tprs[-1][0] = 0.0 
        
                       roc_auc = auc(fpr, tpr) 
        
                       aucs.append(roc_auc) 
        
                       plt.plot(fpr, tpr, lw=1, alpha=0.3, 
        
                                label='ROC fold %d (AUC = %0.2f)' % (j, roc_auc)) 
        
                       j = j + 1 
        
                   f3.close() 
        
                   if fes_violin_plot_accuracies_per_folds_i_classificador: 
        
                       dic_acc_per_fold_i_classificador[noms_classificadors[i]] = [] #CREO LES CLAUS DEL DICCIONARI, CAL INICIALITZAR-LES PRIMER RECORDA, SINO DONA ERROR 
        
                       dic_acc_per_fold_i_classificador[noms_classificadors[i]] = ll_acc_classificador #afegeixo al diccionari una llista amb les accuracies per fold 
        
                   #print("######################################");print("######################################",file=f4) 
        
                   #print("Accuracy mitjana (mitjana 10 accuracies): {:.4f}".format(accuracies_mitjanes_mal_procediment/10));print("Accuracy mitjana (mitjana 10 accuracies): {:.4f}".format(accuracies_mitjanes_mal_procediment/10), file=f4) 
        
                   print("\n\n##################################################");print("\n\n\n\n##################################################",file=f4) 
        
                   print("RESULTAT DESPRES DE SUMAR ELS FP,TN,TP,FN DELS 10 FOLDs\n               ("+nom_clf+")");print("RESULTAT DESPRES DE SUMAR ELS FP,TN,TP,FN DELS 10 FOLDs\n               ("+nom_clf+")",file=f4) 
        
                   print("####################################################"); print("####################################################", file=f4) 
        
                   brier_score = brier_score_loss(y_certes, y_predites)   
        
                   accuracy, s, e, ppv, npv = (suma_tp+suma_tn)/(suma_tp+suma_tn+suma_fp+suma_fn), suma_tp/(suma_tp+suma_fn), suma_tn/(suma_fp+suma_tn), suma_tp/(suma_tp+suma_fp), suma_tn/(suma_fn+suma_tn) 
        
                   LRmes, LRmenys = s/(1 - e), (1 - s)/e 
        
                   #faig intervals de confiansssa! 
        
                   print("\nAccuracy: {:.5f}         (IC95 % de {:.2f} A {:.2f})".format(accuracy, IC95(accuracy,n_samples)[0],IC95(accuracy,n_samples)[1]));print("\nAccuracy: {:.5f}         (IC95 % de {:.2f} A {:.2f})".format(accuracy, IC95(accuracy,n_samples)[0],IC95(accuracy,n_samples)[1]),file=f4) 
        
                   print("Sensibilitat: {:.5f}     (IC95 % de {:.2f} A {:.2f})".format(s,IC95(s,n_samples)[0],IC95(s,n_samples)[1]));print("Sensibilitat: {:.5f}     (IC95 % de {:.2f} A {:.2f})".format(s,IC95(s,n_samples)[0],IC95(s,n_samples)[1]), file=f4) 
        
                   print("Especificitat: {:.5f}   (IC95 % de {:.2f} A {:.2f})".format(e,IC95(e,n_samples)[0],IC95(e,n_samples)[1]));print("Especificitat: {:.5f}   (IC95 % de {:.2f} A {:.2f})".format(e,IC95(e,n_samples)[0],IC95(e,n_samples)[1]),file=f4) 
        
                   print("Valor pred + (PPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(ppv,IC95(ppv,n_samples)[0],IC95(ppv,n_samples)[1]));print("Valor pred + (PPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(ppv,IC95(ppv,n_samples)[0],IC95(ppv,n_samples)[1]), file=f4) #tambe anomenat precision score. hi ha una funcio a sklearn que ho fa#tambe anomenat precision score. hi ha una funcio a sklearn que ho fa 
        
                   print("Valor pred - (NPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(npv,IC95(npv,n_samples)[0],IC95(npv,n_samples)[1]));       print("Valor pred - (NPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(npv,IC95(npv,n_samples)[0],IC95(npv,n_samples)[1]),file=f4) 
        
                   print("LR+: {:.5f}".format(LRmes));print("LR+: {:.5f}".format(LRmes),file=f4) 
        
                   print("LR-: {:.5f}".format(LRmenys));print("LR-: {:.5f}".format(LRmenys),file=f4) 
        
                   print("Brier score final: {:.3f} ".format(brier_score));print("Brier score final: {:.3f} ".format(brier_score),file=f4) #calibracio & performance del model 
        
                   print("###############C O N F U S I O N   M A T R I X##############");print("###############C O N F U S I O N   M A T R I X##############",file=f4) 
        
                   print("                              alzh    no alzh");print("                              alzh    no alzh",file=f4) 
        
                   print("sumatoris (tp,fp) |  +test     {:2}     {:2}".format(suma_tp, suma_fp)); print("sumatoris (tp,fp) |  +test     {:2}     {:2}".format(suma_tp, suma_fp),file=f4) 
        
                   print("          (fn,tn) |  -test     {:2}     {:2}".format(suma_fn,suma_tn));print("          (fn,tn) |  -test     {:2}     {:2}".format(suma_fn,suma_tn),file=f4) 
        
                   print("############################################################\n");print("############################################################\n", file=f4)     
        
                   #guardo les confusion matrix per fer un heatmap despres 
        
                   os.chdir("."+os.sep+carpeta) 
        
                   np.save("confusion_matrix__"+noms_classificadors[i], np.array([[suma_tp, suma_fp],[suma_fn, suma_tn]])) 
        
                   os.chdir("../") 
        
                   #GUARDO AL FITXER DE RESUM PER CLASSIFICADOR (QUE CREA EL TXT QUE ANIRA A LATEX A LA TAULA DE ACCURACIES) 
        
                   #accuracy&sensibility&specificity&ppv&npv&LR+&LR str( 
        
                   f1.write(noms_classificadors[i]+" & {:.2f} & {:.2f} & {:.2f} & {:.2f} & {:.2f} & {:.3f} & {:.3f}\\\\\n".format(accuracy*100, s*100, e*100, ppv*100, npv*100, LRmes, LRmenys)) 
        
                   f2.write("{:.3f}".format(brier_score)+"& HOSMER SIMPSON\\\\\n") #https://www.data-essential.com/hosman-lemeshow-in-python/ 
        
                   plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', 
        
                            label='Luck', alpha=.8) 
        
                   mean_tpr = np.mean(tprs, axis=0) 
        
                   mean_tpr[-1] = 1.0 
        
                   mean_auc = auc(mean_fpr, mean_tpr) 
        
                   std_auc = np.std(aucs) 
        
                   plt.plot(mean_fpr, mean_tpr, color='b', 
        
                            label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), 
        
                            lw=2, alpha=.8) 
        
                   std_tpr = np.std(tprs, axis=0) 
        
                   print("\n######################################");print("\n######################################",file=f4) 
        
                   #print("mean_tpr: ",mean_tpr,"std_tpr",std_tpr) 
        
                   print("mean_auc:",mean_auc, "std_auc",std_auc);print("mean_auc:",mean_auc, "std_auc",std_auc, file=f4) 
        
                   print("######################################\n");print("######################################",file=f4) 
        
                   tprs_upper = np.minimum(mean_tpr + std_tpr, 1) 
        
                   tprs_lower = np.maximum(mean_tpr - std_tpr, 0) 
        
                   plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2, 
        
                                    label=r'$\pm$ 1 std. dev.') 
        
                   plt.xlim([-0.05, 1.05]) 
        
                   plt.ylim([-0.05, 1.05]) 
        
                   plt.xlabel('False Positive Rate (1-especificity)') 
        
                   plt.ylabel('True Positive Rate (sensitivity)') 
        
                   plt.title(nom_clf) 
        
                   plt.legend(loc="lower right") 
        
                   plt.show() 
        
               f1.write("\\bottomrule") 
        
               f2.write("\\bottomrule") 
        
               f4.close() 
        
               f1.close()  
        
               f2.close() 
        
               #UN COP HA ENTRENAT TOTS ELS MODELS CREO EL DATAFRAME D'ACCURACIES PER CLASSIFICADOR I EL PLOTEJO AMB SEABORN 
        
               df_acc_per_classificador = pd.DataFrame(dic_acc_per_fold_i_classificador) #com a claus dels diccionaris hi ha els noms dels classificadors ja definits abans en la llista del principi 
        
               sns.set(style="whitegrid") 
        
               ax = sns.violinplot(data=df_acc_per_classificador, palette="muted", order=["Multilayer perceptron (ANN)","Logistic Regression","Support Vector Machines (linear kernel)", "Nearest Neighbours","Gaussian Naive Bayes"]) 
        
               plt.ylabel("Accuracy") 
        
               plt.xlabel("Machine learning model (classifier)") 
        
               #plt.ylim(0,1) 
        
               plt.show() 
        
               #FAIG ARA EL MATEIX AMB LES SENSIBILITATS I ESPECIFITATS PER CLASSIFICADOR 
        
               plt.close() 
        
               sns.set(style="whitegrid")                          #strin del model, si es Sensitivity o specificity, mesura (proporcio) 
        
               df_sensEsp_per_classificador = pd.DataFrame(columns=["model","metric","p"], data=ll_sens_espe_classificador) 
        
               ax = sns.violinplot(x="model", y="p", hue="metric",data=df_sensEsp_per_classificador, palette="muted",order=["Multilayer perceptron (ANN)","Logistic Regression","Support Vector Machines (linear kernel)", "Nearest Neighbours","Gaussian Naive Bayes"]) 
        
               plt.ylabel("sensibility/specificity") 
        
               plt.xlabel("Machine learning model (classifier)") 
        
               #plt.ylim(0,1) 
        
               plt.show()

aux__ploteja_matrius_correlacions.py
aux__reduccio_dimensions.py: This file was done in order to implement a graphical interpretation for the explained variance of a principal components analysis (i.e it plots the eigenvalues of each component for the previously chosen number of components). The scree plot is a graphical way of seeing whether each component of a PCA adds value to what we are trying to model in terms of parsimony. Softwares such as SPSS or R have the possibility of getting a Scree plot easily, but python doesn't (at least, not while I was doing this thesis). So I coded this program to get the scree plot.

The packages I've used in this thesis are sklearn for machine learning, matplotlib and seaborn for graph generation and visualization, pandas for data analysis, numpy for treating the multidimensional arrays that came out of the NIFTI files that came out of the fMRIs.

The ADNI(Alzheimer's disease neuroimaging iniciative) gathered together high quality, free to use data on the alzheimer's disease, an without this data this thesis wouldn't have been possible. ↩
Lots of files are missing as github has limited space, so this is a simplification of the directories of my final thesis, not an exhaustive tour to it! ↩
NIFTI files are not here, as they occupy several Gigabytes of data. This is just for showcasing purposes. ↩
Region of Interest (ROI). ↩
(93,214,140) --> (SUBJECTS, ROIs,TIME SERIES) ↩
In my final thesis I took subjects that were at risk of developing Alzheimer's disease but didn't have the disease (these are what are called Mild Cognitive Impairment or MCI). At the moment of diagnosis as MCI, they had to undergo a brain fMRI scan. Then time went on and after some years some of them became Alzheimer's disease (MCI-c), and some didn't (MCI-nc). The ones who I labeled as MCI-c are clear to label, but the ones who are MCI-nc are not clear (because there might be not enough longitudinal data because the patient might die, get discontinued from the study). So it was important to choose a criterion to decide how many years of follow up are enough for a patient to be considered free of Alzheimer's. ↩
The actual code got 74 subjects, as changes in the inclusion criteria (MCI-nc cutoff time point) varied. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Extended Summary

Abstract (for a more technical read)

Folders and files to comment

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
DataAnalysis		DataAnalysis
Writing		Writing
README.md		README.md
TFM_FINAL_santiagosanchez.pdf		TFM_FINAL_santiagosanchez.pdf
desktop.ini		desktop.ini

	def fes_merge(n_timeseries):
	"""
	GUARDATS.
	arr_3d --> numpy ndarray amb shape [93,214,140] >-->[SUBJECTES, ROIsSHEN,TIME SERIES]
	ll_subs = [SUBJECTE 1, SUBJECTE 2, ... , SUBJECTE 93]
	ll_uids = [UID 1, UID 2, ..., UID 93]


	NOTA:

	les llistes son LA INFO dels subjectes. son correlatives a la primera dimensio (dimensio de subjectes)
	de la matriu arr_3d.

	Qualsevol dels dos valors són útils per identificar els subjectes. UID es un identificador unic que hi ha
	per a cada escaner de l'ADNI.


	"""
	j = 0
	carpetes = os.listdir(os.getcwd())
	carpetes.sort()
	ll_matriu_3d = []
	ll = []
	ll_subs = []
	ll_uids = []
	for carpeta in carpetes:
	if "." not in carpeta:
	j = j + 1
	ids = carpeta.split("__")

	#GUARDO ELS IDS DELS SUBJECTES I ELS LL_UIDS (correlatius a la primera dimensió de arr_3d)
	ll_subs += [ids[0]]
	ll_uids += [ids[1]]

	os.chdir("./"+carpeta+os.sep+"tc")
	print("Subjecte: "+carpeta)
	print("\n")
	ll_matriu_subjecte= []
	k = 0

	with open("total.txt","r") as f:
	for linia in f:
	k = k +1
	ll_linia = linia.split() #fora salts de linia i espais

	for i in range(len(ll_linia)):
	ll_linia[i] = float(ll_linia[i]) #passem tipus de dades a real
	if not len(ll_linia) == n_timeseries:
	raise ValueError("Compte! Que el subjecte i escaner amb id "+carpeta+" \nno té les "+str(n_timeseries)+" time series que esperaves")
	ll_matriu_subjecte = ll_matriu_subjecte + [ll_linia] #ll_matriu_subjecte conte matriu (214,140) (ROIs, TS)

	print("Subj_escan "+carpeta+" \|\| "+"TS (columnes): "+str(len(ll_matriu_subjecte[0]))," \| ROIs (files): "+str(len(ll_matriu_subjecte)))
	os.chdir("../../")
	ll_matriu_3d = ll_matriu_3d + [ll_matriu_subjecte] #COMPROVA QUE TINGUI [93,214,140] >-->[SUB,ROIs,TS]
	ll = ll + [k]
	#if j == 10:
	# break
	print(len(ll_matriu_3d),len(ll_matriu_3d[0]),len(ll_matriu_3d[0][0]))
	arr_3d = np.array(ll_matriu_3d) #COMPROVA QUE TINGUI [93,214,140] >-->[SUB,ROIs,TS]
	print("TOTS ELS SUBJECTES HAN DE TENIR ELS MATEIXOS VALORS EN CADA ELEMENT DE LA SEGUENT LLISTA. SI NO, CAL BORRAR TOTAL.TXT, REEXECUTAR MAKEVECTOR I TORNAR A EXECUTAR FES_MERGE")
	print(ll)

	#FINALMENT GUARDEM LA NUMPY ARRAY I TAMBÉ ELS NOMS DELS SUBJECTES I ELS UIDS DELS ESCANERS
	# EN LLISTES CONTINGUDES EN TXTS, CORRELATIVES A LA PRIMERA DIMENSIO DE ARR_3D (DIMENSIO STACK)
	np.save("__arr_ADNI_3d_preprocessada",arr_3d)
	f1 = open("__subjectes.txt","w")
	f2 = open("__uids.txt","w")

	for i in range(len(ll_subs)): #HAN DE TENIR LA MATEIXA LLARGARIA LES LLISTES LL_SUBS I LL_UIDS.
	f1.write(ll_subs[i]+'\n')
	f2.write(ll_uids[i]+'\n')
	f1.close()
	f2.close()
	return "\nshape array subjectes: "+str(arr_3d.shape)+"\nnresubjectes: "+str(len(ll_subs))+"\nnre uids: "+str(len(ll_uids))

	print(fes_merge(140)) #Comprova que les times series son exactament les que posem al parametre de la funcio per a tots els subjectes. SI no fos el cas, es queixaria i donaria un VAlueerror

	def fes_dataframe_despres_dels_3_criteris_dinclusio(desv_estandar_de_tall):
	"""
	Funcio que elimina part dels 93 subjectes segons els tres criteris d'inclusió
	que haviem decidit aplicar a l'apartat 2.3.2 Inclusion and exclusion criteria del protocol / apartat del mètode i també els d'exclusió.

	INPUT: Valor de desviació estandar a sumar a la mitjana de la distribució de temps
	que el subjecte MCI-C tarda a convertir-se, que serveix com a anys de tall per a
	la distribució de MCI-NCs

	RETURNS: pandas dataframe REDUIT, 2 dataframes redundants (que tenen el mateix, pero separats epr MCI-C i MCI-NC)
	0) COMENÇO AMB EL CRITERI EXCLUSIO: is consistency will be evaluated as the absence of diagnostics that lower in severity after already having been diagnosed as rollover: i.e. if one MCI participant is later diagnosed as AD but in the next follow-up it comes back to being diagnosed as MCI or even diagnosed as Healthy the subject will be excluded.
	a) Participants that have an fMRI scan of the same submodality.
	b) Those scans will need to be within at least a \textit{$\pm 2 $ month interval} from the baseline diagnostic (a excpeció dels rollover). FA FALTA JUSTIFICAR B LA DIFERENCIA ENTRE ELS FOLLOW-UP MES PROPERS PER ALS ADNI 1 I LA EXAM DATE
	c) Among this sample, at least 80\% of the individuals need to have either CSF total, or CSF A$\beta_{1-42} $ or ADAS or MMSE scores to allow a multimodal approach according to \textbf{\textit{$H_{2}$}}.
	d) MCI-NC individuals must have been in a minimum of \textbf{$n$ years of a follow-up}.

	The value $n$ will be chosen by sorting in descending order
	follow-up times in MCI-NC and eliminating those subjects whose time value is below the mean time
	of conversion for MCI-C + 1.5 standard deviations". Means and SDs come from the initial 332 MCI patient sample.
	MCI-C [time until conversion]---> ($x$ = 2.3 years; $s$ = 1.85) --> value n calculation = 2.3 + 1.85*1 --> minimum of 4.15 years of followup
	MCI-NC[time until loss of followup]---> ($x$ = 4.69 years; $s$ = 2.357)
	"""

	df = pd.read_csv("out_df_final_pleedeADNIMERGE.csv") # --> té 93 subjectes

	# criterion a) son 93
	registres = df.shape[0]
	swap = registres
	print("criteri inclusio a: inclosos "+str(registres)+" (d entre els 332 de la seccio de metodes).")

	# EXCLUSION CRITERION Out of the 93 initial subjects only four present a lowering in severity after an AD diagnosis. Among those the exclusion criteria (see two definitions of "inconsistency" in the text) were applied.
	#df = df[df.PTID != '002_S_4746'] # WE DO NOT EXCLUDE HIM --> 002_S_4746 (4 consecutius de MCI (amb 2 missings entre mig). Dpesres 1 a AD i inmediatament despres de nou a MCI i depsres perdua followup.
	df = df[df.PTID != '031_S_4005'] # WE MUST EXCLUDE THEM FOR TWO MCI DIAGNOSIS AFTER THE AD ONE! --> 031_S_4005 xx index 1 (12 mesos per conv)-- DOLENT -------------- (First 2 follow-ups AD. Then Goes back to MCI (2 consecutive followups corroborate the MCI and Then follow-up stops.
	df = df[df.PTID != '019_S_4293'] # EXCLUDE THEM FOR OSCILATING PATTERN --> 019_S_4293 (24 mesos per conv) t (mci --> ad --> mci --> ad)
	df = df[df.PTID != '031_S_4947'] # EXCLUDE THEM FOR OSCILATING PATTERN -->031_S_4947 x -- POTSER DOLENT ----(MCI --> x --> AD --> MCI --> AD --> loss followup) --> varia, oscila ??

	registres = df.shape[0]
	print("criteri exclusio (inconsistencia):\n ara queden "+str(registres)+" (dels originals "+str(swap)+" de metodes): \no sigui, "+str(swap - registres)+"exclosos.")

	# INCLUSION CRITERION
	#time.sleep(5)

	#criterion b) Check whether at least 80 % of subjects have +- 2 months difference from bl.dx vs (examdate), EXCEPT those ADNI 1 (because dx.bl was long time ago).
	df_no_ADNI1 = df[df["fase_estudi"] != "ADNI_1"]

	MCIc_seguits_minim_2mesos = df_no_ADNI1[df_no_ADNI1["dif_examdate_vs_dxbl"].abs() <= 60][df["s_CONVER"] == 1].shape[0]
	MCInc_seguits_minim_2mesos = df_no_ADNI1[df_no_ADNI1["dif_examdate_vs_dxbl"].abs() <= 60][df["s_CONVER"] == 0].shape[0]
	MCIc, MCInc = df_no_ADNI1[df_no_ADNI1['s_CONVER'] == 1].shape[0], df_no_ADNI1[df_no_ADNI1['s_CONVER'] == 0].shape[0]

	print("CRITERI EXCLUSIO B requereix 80% min") #NO BORREM A NINGU
	print(" percentatge MCIc amb <60 dies bl vs examdate: ", MCIc_seguits_minim_2mesos/MCIc) #90%
	print(" percentatge MCIncc amb <60 dies bl vs examdate: ", MCInc_seguits_minim_2mesos/MCInc) #91 %
	#time.sleep(2)


	# CROTERION c) defined below (we cut)!

	# STEP 1) SPLIT MCI_C and MCI_NC.
	df_MCIc, df_MCInc = df[df['s_CONVER'] == 1], df[df['s_CONVER'] == 0] #26 MCI_C i len 67 MCI_NC respectivament

	# STEP 2) Delete MCI_NC with follow up times below the one defined in methods section (2.3 + 1.85 * 1 = 4.15 years of minimum follow up)
	# This is done to delete those MCI_NC who are prone to be false negatives (see -threshold n- listed and below)
	n = 2.3 + 1.85 * desv_estandar_de_tall #YEARS THRESHOLD CUT-OFF
	df_MCInc, df_MCInc_NOINCLOSOS_PERCRITERI_C = df_MCInc[df_MCInc["DIES-SEGUIM"] > n365.25], df_MCInc[df_MCInc["DIES-SEGUIM"] <= n365.25]

	#df_MCIc = df_MCIc[df_MCIc["DIES-SEGUIM"] >= 365] Amn aixo pots tallar els MCInc per damunt o per sota d'un temps de conversio. Empitjora la sensibilitat.
	# REPORTING INFORMATION FOR INCLUSION CRITERION D.
	print('{:.0f} selected MCInc with follow-up time above cutoff in *years\n{:.0f} selected MCIc'.format(len(df_MCInc),len(df_MCIc)))

	print("* "+str(n)+" punt de tall -en anys- de minim follow-up assegurat per a als MCI_nc")
	print('El cutoff de n fa excloure un total de '+str(len(df_MCInc_NOINCLOSOS_PERCRITERI_C['PTID'].tolist()))+" subjectes (concretament... DESMARCACOMENTARI I MIRA "+str(df_MCInc_NOINCLOSOS_PERCRITERI_C['PTID'].tolist()))

	time.sleep(1)
	# STEP 3) WE CREATE a df with two keys AND ALSO A NEW CSV for analysis with other programs (such as SPSS)
	df = pd.concat([df_MCIc, df_MCInc])

	# bonus: WE ADD EXACT SUBJECT_AGES AT THE MOMENT OF ACQUIRING THE FMRI SCANS! (THEY COME FROM XML FILES )
	df_xmls = pd.read_csv("in__resum_parametres_fmri_EDATS_EXACTES_EN_MOMENTS_D'ESCANER.csv")
	df_uid_edat = df_xmls[['UIDs','AGE_ambdecimals']]
	df = pd.merge(df,df_uid_edat, how = "left", on="UIDs",sort=False)
	#bonus 2: WE WANT TO KNOW THE TIME THAT GOES FROM SCANNING TIME AND MOMENT OF END OF FOLLOW UP(MCI-NC)/CONVERSION (MCI-C).
	# EXAM-DATE VARIABLE GIVES US THE TIME THAT GOES FROM BL.DX UNTIL END OF FOLLOUP / CONVERSION, and it is pretty informative
	# for all participants, except those who do not belong to the ADNI 1 originally. For those DIES-SEGUIM are way too big.
	# Then we have dif_examdate_vs_dxbl, which gives us the diference examdate (scan time) vs baseline-diagnostic.
	df['FI-FOLLOWUP-O-CONV_vs_data-escaneig'] = pd.Series((df["DIES-SEGUIM"] - df["dif_examdate_vs_dxbl"])/365.25)


	# Thus, to know the time SINCE rsfMRI scan until conversion/loss of followup for all subjects we simply need to compute the new variable
	# doing DIES-SEGUIM - dif_examdate_vs_dxbl will. NEW VARIABLE will be called ENDFOLLOW-or-CONV_vs_data-escaneig

	#EVALUEM QUINS CENTRES HI HA
	print("CENTRE, FREQUENCIA PER CENTRE")
	print(df["SITE"].value_counts())

	#EVALUEM QUINS CENTRES HI HA (GRUP MCI-C)
	print("CENTRE, FREQUENCIA PER CENTRE (MCI-c)")
	print(df["SITE"][df["s_CONVER"]==1].value_counts())

	#EVALUEM QUINS CENTRES HI HA (GRUP MCI-nc)
	print("CENTRE, FREQUENCIA PER CENTRE (MCI-nc)")
	print(df["SITE"][df["s_CONVER"]==0].value_counts())


	time.sleep(1)




	#guardem el df a csv
	df["ABETA"] = df["ABETA"].replace(to_replace = ">1700", value = '1701').astype(float) #treiem primer el caracter especial superior o igual que dels valors ">1700" i ho substituim per un valor factible: 1701] cal passar a float perquè despres de la conversio tot esta en string i no se pq
	df["ABETA/PTAU"] = df["ABETA"]/df["PTAU"] # [creem la nova variable ABETA/PTAU, que suposadament es un index que podria contribuir al diagnostic d'acord amb la lite previa (compte que no sabem si es abeta 1 42, crec que es abeta general)]
	df.to_csv("out_df_final_pleedeADNIMERGE_postcriterisInclusio.csv",sep = ",", index = False)

	#fem estadística descriptiva per variables importants dins ll_variables
	v_quantis=[('escan_vs_fi','FI-FOLLOWUP-O-CONV_vs_data-escaneig'),("AGE","AGE_ambdecimals"),("MMSE","MMSE"),("TAU","TAU"),("PTAU","PTAU"),("ABETA","ABETA"),("AB/PTAU","ABETA/PTAU"),("FDG","FDG"),("ADAS11","ADAS11"),("ADAS13","ADAS13"),("ADASQ4","ADASQ4")]#[(label_variable, nom_variable_dins_Dataframe)"}]
	v_categ = [("EMCI/LMCI","BL.dx"),("MALE/FEMALE","PTGENDER")]

	# criterion d)
	# CHECKED MANUALLY. no subjects excluded. dins els MCIc 1/23 (96%) no tenien tau, ptau o beta a. I dins els MCInc eren 3/51 (94%)

	#__________INFORMEM DELS NaNs__________________________
	print("\n##############################")
	print("_____NaNs per variable_______")
	for variable in v_quantis:
	label_variable, nom_variable_dins_Dataframe = variable
	nre_NaNs_a_variable = df[variable[1]].isna().values.tolist().count(True)
	print(variable[0],": ",nre_NaNs_a_variable," sobre "+str(len(df))+"({:.2f} %)".format(100*nre_NaNs_a_variable/len(df))+")") #Conto els NaNs
	time.sleep(1)

	#______Evaluem normalitat i homocedasticitat de les proves per a cada variable quantitativa en cada subgrup (MCI-C i MCI-NC) (condicions aplicacio t-test!)_________
	print("\n SHAPIRO-WILK (NORMALITAT) \|\|\| levene ")
	##https://es.wikipedia.org/wiki/Test_de_Shapiro%E2%80%93Wilk \|
	print(" MCIc MCInc \|\|\| MCIc vs MCInc ")
	t_test = [] #guardo les variables que requereixen ttest. les altres fem mann whitney.
	for variable in v_quantis:
	label_variable, nom_variable_dins_Dataframe = variable
	variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna()
	shapiro_MCIc, p_shapiro_MCIc = stats.shapiro(variable_MCIc); shapiro_MCInc, p_shapiro_MCInc = stats.shapiro(variable_MCInc)
	levene_statistic, p_levene = stats.levene(variable_MCIc, variable_MCInc)
	if p_shapiro_MCIc < 0.05 or p_shapiro_MCInc < 0.05: #Si la variable NO ES NORMAL segons shapiro wilk
	if p_levene < 0.05: #si no hi ha homogeneitat de variances
	print('{:7} {:.2f} (p = {:.5f}) \| {:.2f} (p = {:.5f})* \|\|\| {:.2f} (p = {:.3f})+[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
	t_test += [False]
	else:#si hi ha homogeneitat de variances
	print('{:7} {:.2f} (p = {:.5f}) \| {:.2f} (p = {:.5f})* \|\|\| {:.2f} (p = {:.3f})[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
	t_test += [False]
	else:#si la variable es normal segons shapiro wilk
	if p_levene < 0.05:#si no hi ha homogeneitat de variances
	print('{:7} {:.2f} (p = {:.5f}) \| {:.2f} (p = {:.5f}) \|\|\| {:.2f} (p = {:.3f})+[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
	t_test += [False]
	else: #si hi ha homogeneitat de variances
	print('{:7} {:.2f} (p = {:.5f}) \| {:.2f} (p = {:.5f}) \|\|\| {:.2f} (p = {:.3f}) [t]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
	t_test += [True]

	print("_____________________________________________")
	print("Ho (shapiro wilk): Poblacio del subgrup donat, per a la variabledonada, \nesta normalment distribuida. L'Asterisc surt quan refutem Ho en qualsevol dels dos grups, o sigui, quan qualsevol dels dos grups no mostra normalitat.")
	print("Ho (levene_test): els dos subgrups tenen homogeneitat de variances per a la variable donada. La creu + surt quan refutem Ho, o sigui quan no hi ha homogeneitat de variances.")
	print("[t] apareix quan la variable es susceptible de comparar-se entre grups amb t-test (exigeix homocedasticitat entre subgrups i distrib normal dins cada subgrup)")
	print("[?] Si no hi ha homogeneitat de variances pero la distribucio es normal... aleshores no se si aplicar mann whitney o t-test...")

	#__________CREEM LA TAULA 1 DE DESCRIPTIUS_____________
	print("\n ********************* DESCRIPTIUS *********************")
	print("\n#################################################################")
	print("__________MCI-C___________________MCI-NC__________valor (p-value)")
	# PRIMER VAN LES QUANTITATIVES
	for i in range(len(v_quantis)):
	variable = v_quantis[i]
	label_variable, nom_variable_dins_Dataframe = variable
	variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna()
	valor_contrast_hipotesis, p_valor, boolea_t_test = funcions_auxiliars.bivariant_comparacio_mitjanes(variable_MCIc, variable_MCInc, t_test[i])
	if boolea_t_test:
	print("{:10}".format(label_variable), "{:.1f} ({:.1f}) {:.1f} ({:.1f}) {:.2f} (p = {:.5f})[t] ".format(variable_MCIc.mean(), variable_MCIc.std(), variable_MCInc.mean(), variable_MCInc.std(),valor_contrast_hipotesis, p_valor))
	else:
	print("{:10}".format(label_variable), "{:.1f} ({:.1f}) {:.1f} ({:.1f}) {:.2f} (p = {:.5f})[mw] ".format(variable_MCIc.mean(), variable_MCIc.std(), variable_MCInc.mean(), variable_MCInc.std(),valor_contrast_hipotesis, p_valor))

	#DESPRES LES CATEGORIQUES
	for variable in v_categ:
	label_variable, nom_variable_dins_Dataframe = variable
	variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna()
	if label_variable == "EMCI/LMCI":
	EMCIs_MCIc, LMCIs_MCIc, EMCIs_MCInc, LMCIs_MCInc = variable_MCIc[variable_MCIc == 3].count(), variable_MCIc[variable_MCIc == 4].count(), variable_MCInc[variable_MCInc == 3].count(), variable_MCInc[variable_MCInc == 4].count()
	valor_contrast_hipotesis, p_valor = 99.99, 0.999
	print("{:10}".format(label_variable), "{:10} {:10} {:.2f} (p = {:.5f})[?] ".format(str(EMCIs_MCIc)+"/"+str(LMCIs_MCIc), str(EMCIs_MCInc)+"/"+str(LMCIs_MCInc), valor_contrast_hipotesis, p_valor))
	if label_variable == "MALE/FEMALE":
	male_MCIc, female_MCIc, male_MCInc, female_MCInc = variable_MCIc[variable_MCIc == "Male"].count(), variable_MCIc[variable_MCIc == "Female"].count(), variable_MCInc[variable_MCInc == "Male"].count(), variable_MCInc[variable_MCInc == "Female"].count()
	valor_contrast_hipotesis, p_valor = 99.99, 0.999
	print("{:10}".format(label_variable), "{:10} {:10} {:.2f} (p = {:.5f})[?] ".format(str(male_MCIc)+"/"+str(female_MCIc), str(male_MCInc)+"/"+str(female_MCInc), valor_contrast_hipotesis, p_valor))

	print("________________________________________________________________")
	print("[t] son t-test per a mostres independents. \n [mw] son Mann whitney")
	print("**COMPTE! Per a aquestes proves de comparacio de mitjanes els NaNs s'han eliminat. Diferent al que farem per al multimodal aproach, que sera mean o median imputation per a cada subgrup.")
	print("? te les odds... pero podria posar proporcions crec")
	print("________________________________________________________________")
	print("################################################################")
	time.sleep(1)
	return df, df_MCIc, df_MCInc

	def mat_corr_labels_i_pipeline(ploteja_totes_les_matrius_de_conectivitat, ploteja_una_matriu_de_conectivitat_mitjana_per_cada_grup, ploteja_funcions_densitat_probabilitat_FCON_per_centre):
	"""
	arr_TS, ll_sub_txt, ll_uids i df tenen dimensions correlatives.
	"""
	t1 = time.clock()
	arr_TS, ll_sub_txt, ll_uids, df = redueix_i_organitza() #arr_TS --> (57,214,140) (SUB, ROI, TS)
	arr_TS = np.transpose(arr_TS,(0,2,1)) # --> arr_TS ara te shape (57,140,214) (SUB, TS, ROI)


	#CREO VARIABLE arr_corr, que contindrà un stack de matrius de correlacions (tantes com subjectes)
	arr_corr = []
	for i in range(len(arr_TS)):
	ll_arr_corr = pd.DataFrame(arr_TS[i]).corr().values.tolist() #ll_ARR_CORR --> (214,214)) --> faig pas DF --> NDARRAY --> LIST per poder fer l'stack bé (no se com fer-ho en numpy)
	arr_corr += [ll_arr_corr]
	arr_corr = np.array(arr_corr) # shape (57,214,214) (Shape pot canviar en funcio del parametre de desviacio tipica, recorda)
	np.save("out_FC ("+str(len(arr_corr))+" subjectes)", arr_TS)
	print("arr_corr shape", arr_corr.shape)

	#CREO VARIABLE X per a a fer ML. X conté la F.Connectivity vectoritzada, o sigui conté
	#CADA MATRIU DE l'stac arr corr queda vectoritzada i guardada com a matriu 2D
	X = []
	for i in range(len(arr_corr)):
	ll_vector = funcions_auxiliars.elimina_tri_superior_i_flateneja_ROIxROI(arr_corr[i])
	X += [ll_vector]

	def elimina_tri_superior_i_flateneja_ROIxROI(m_correl):
	"""
	PRE: This method takes a numpy array that contains a matrix of correlationsnxn called m_correl (ROIxROI)
	POST: Returns ll_ROIxROI_flattened (i.e upper triangle of m_correl and the main diagonal gets eliminated, and the lower triangle vectorized and
	returned as a a list type object as a ROW of (1/2) * n (n-1) correlations (elements). Thus, If m_correl input size us 214x214 -shen atlas- you will
	get 22,791 valies within ll_ROIxROI_flattened

	NOTE: This Does the same as numpy.tril function
	"""
	#m_correl es dataframe de pandas. com que vull operar com a llista... \|\| FAIG A LA SEGUENT LINIA LE PAS dataframe ----> array numpy ----> llista
	ll_correl = m_correl.tolist() #ndarray --> llista
	ll_ROIxROI_flattened = []
	for i in range(1,len(ll_correl)):
	bufer = ll_correl[i][:i]
	ll_ROIxROI_flattened = ll_ROIxROI_flattened + bufer
	return ll_ROIxROI_flattened

	#CREO VARIABLE Y, que contindrà les labels (1 --> MCI-C \| 0 --> MCI-NC)
	Y = df["s_CONVER"].values; print("Y shape: ", Y.shape) # Y te shape (57,)
	print("Y shape: ",Y.shape)
	time.sleep(2)


	#ANALISI INDEPENDENT 2) FEM 10-fold-crossvalidation + SVM --> RESULTATS MOLT POBRES AMB PCA, tant estandaritzant com sense fer-ho. No diagnostica cap malalt, specificiat 0 amb PCA. probats varis components
	apc.PCA_10foldCV_ROC(X=X,
	y=Y, # NOTA_ PCA_by_santi i rfecv son mutuament excloents
	PCA_by_santi=False, # primer boolea True si vols fer PCA (false si no el fas)#Si vols estandaritzar columnes en fer el PCA (Si no el fas es igual si es True o false)has d'anar a dins el fitxer i buscar "estandaritzar_variables"
	rfecv=False, # no funciona... si poses True hauria de fer Recursive feature elimination. Si poses False no te la fa
	fes_violin_plot_accuracies_per_folds_i_classificador=True,
	carpeta='BIOMARCADORS_models_finals') # SI TOQUES COSES DE fMRI treu-ho a la carpeta 'fMRI_models_finals'

	#NOTA: LA CARPETA ON GUARDARAS TOTES LES ACCURACY METRICS. Ha d'existir primer o dona error.

	def PCA_10foldCV_ROC(X,y,PCA_by_santi,rfecv,fes_violin_plot_accuracies_per_folds_i_classificador,carpeta):
	from aux__reduccio_dimensions import Analisi_components_principals as dades_acp
	"""
	X = Dades d'entrada
	y = labels
	PCA = BOOLEA sobre si vols PCA o no abans de introduir aquestes dades al model
	RFECV = BOOLEA sobre si vols aplicar Recursive Feature Elimination per a cada classificador o no
	"""

	#n_samples, n_features = X.shape
	n_samples = X.shape[0]
	################################## Classification and ROC analysis#########################


	#Posa els noms dels classificadors i els noms dels objectes. Aixi pots iterar en cada un d'ells i fer una 10-fold CV per a cada un.
	noms_classificadors = ["Logistic Regression",
	"Gaussian Naive Bayes",
	"Support Vector Machines (linear kernel)",
	"Nearest Neighbours",
	"Multilayer perceptron (ANN)"]
	objectes_classificadors = [LogisticRegression(C=10000, penalty='l2', multi_class= 'multinomial', solver='lbfgs', class_weight = "balanced"),
	GaussianNB(),
	svm.SVC(kernel='linear', probability=True, random_state=1, class_weight = "balanced"),
	KNeighborsClassifier(3),
	MLPClassifier(alpha=1, hidden_layer_sizes=(100,), activation='relu', solver='sgd')] #PROVA ADAM

	#aqui guardo les accuracies per classificador, i en faig un violin plot després.
	#NOTA QUE AMB AIXO PUC CREAR FACILMENT UN PANDAS DATAFRAME I FER EL VIOLINPLOT DESPRÉS
	dic_acc_per_fold_i_classificador = {} #{"Logistic Regression":[FLOAT_accuracy_fold_1, FLOAT_accuracy_fold_2, [...], FLOAT_accuracy fold 10], "Gaussian Naive Bayes":[], ..., "Multilayer perceptron (ANN)":[]}

	#CREEM ELS FITXERS ON GUARDAR LES METRIQUES PER CADA MODEL (TANT LES MESURES DE DIAGNOSTIC ACCURACY COM LES BRIER SCORES)
	f1 = open("./"+carpeta+"/LATEX_acc_s_e_ppv_npv_lrmes_lrmenys_PERCADAMODEL_.txt","w"); f1.write("\n\\toprule\nclassifier&accuracy&sensibility&specificity&ppv&npv&LR+&LR-\n\midrule\\\\\n") #fes la capsalera per a latex
	f2 = open("./"+carpeta+"/LATEX_brierScores_PERCADAMODEL_.txt","w"); f2.write("\\toprule\nBrier Scores & H.L. test\\\n\midrule\n") #fes la capsalera per a latex
	f4 = open("./"+carpeta+"/intervalsConfiansaIconfusionsMatrix.txt","w")
	f_readme = open("./"+carpeta+"/README_PREDICT_PROBAS.txt","w")
	f_readme.write("L'arxiu PREDICT_PROBA_NOMMODEL conte columnes separades per espais blanc segons:\n\n'PREDICCIO' --> (0 mci-nc, 1 mci-c)\n'LABEL'--> (idem a valor predit pero el valor que prove de ADNIMERGE, el valor real)\n'P_mci_nc'--> prob de pertanyer a grup 0 o mci-nc.\n'P_mci_c' --> probabilitat de pertanyer a grup MCI-c (1-p) o grup codificat amb 1.\n'Fold' --> el fold la crossvalidation don prove el subjecte.\n\nNoteu que aquest arxiu ajunta tots els folds de la cross validation i tindra tants subjectes\ncom subjectes hi hagi a la mostra. Cal tenir en compte que l'ordre es preserva perque CV.split() -on CV es objecte Stratified k-fold-\nens permet aleatoritzar sempre de la mateixa\nmanera, perque random_state=None per defecte.\nAixo permet que aquestes dades es puguin prendre per fer un multimodal approach de late integration.")
	f_readme.close()
	matriu_violinplot_sensibEsp = []
	############################################################################################
	#ITEREM EN CADA CLASSIFICADOR. FEM 10 FOLD CROSS VALIDATION. FEM LES ROC CURVES PER CADA CLASSIFICADOR.
	ll_sens_espe_classificador = [] #per fer violinplot sens i espec
	for i in range(len(objectes_classificadors)):
	f3 = open("./fMRI_models_finals/PREDICT_PROBA_"+noms_classificadors[i].replace(" ","_")+".csv","w") #GUARDO PREDIT, GS,
	f3.write("PREDICCIO,LABEL,P_mci_nc,P_mci_c\n")
	clf, nom_clf = objectes_classificadors[i], noms_classificadors[i] #carrego l'objecte classificador i l'string que descriu cada classificador
	CV = StratifiedKFold(n_splits=10)

	#IMPLEMENT RECURSIVE FEATURE ELIMINATION [NO FUNCIONA]
	if rfecv:
	clf = RFECV(estimator=clf, step=1, cv=CV, scoring='accuracy')
	tprs, aucs = [], []; mean_fpr = np.linspace(0, 1, 100) #COSES DE LA PLANTILLA


	#creem llistes buides per acumular els true negatives, flase positives, false negatives i true positives per a despres fer una confusion matrix final.
	suma_tn, suma_fp, suma_fn, suma_tp = 0,0,0,0
	accuracies_mitjanes_mal_procediment = 0
	ll_acc_classificador = [] #creo llista buida


	j = 1 #per contar els folds
	#ITEREM EN ELS DIVERSOS FOLDS CREATS PER SKLEARN, DINS CADA CLASSIFICADOR
	y_certes, y_predites = [], [] #per a crear les brier scores per a cada classificador
	for train, test in CV.split(X, y): #CSEPARA ELS INDEX
	#OJO AMB ELS EFECTES ALIAS! PER AIXO LA COPIO...
	X_entrenament = np.copy(X[train])
	X_test = np.copy(X[test])
	if PCA_by_santi:
	estandaritza_variables = False #POSA TRUE SI VOLS ESTANDARITZAR LES VARIABLE SPER COLUMNES. POSA FALSE EN CAS CONTRARI
	dades = dades_acp(np.array(X_entrenament,dtype = np.float64), nombre_components = None, estandaritza_variables = estandaritza_variables, llista_variables=[])
	#dades.matriu_correlacions(grafic_correlacions = False) # ACP es basa en la matriu de corr. davant absencia de proves de esfericitat (bartlett,KMO) es importantissim mirar que la matriu de correlacions no s'apropa a la matriu identitat. Volem variables correlacionades o no te sentit aplicar PCA
	#dades.autovalors_i_variansssa(scree_plot=True) # veure autovalors, variança explicada i screeplot. Amb això (especialment amb l'screeplot), decideix quants components retenir combinant variança explicada i parsimònia.
	#dades.carregues_factorials() # veure les correlacions entre cada factor i cada variable
	dades.carregues_factorials()#vaig fer un castell de naipes i cal execitar aquesta funcio si o si, o la eguent no va.
	X_entrenament, ll_mitjana_train, ll_desvest_train, carregues_factorials_train = dades.puntuacions_factorials(imprimeix_dataset_reduit = False) # METODE QUE RETONRA EL DATAET REDUIT. ndarray reduit! em prenc les mitjanes i desviacions estandar estimades per a fer la estandaritzacio des de les dades d'entrenament
	if estandaritza_variables:
	X_test = ((X_test - ll_mitjana_train) / ll_desvest_train).dot(carregues_factorials_train) #apliquem column wise les operacions
	else:
	print(X_test.shape)
	print(carregues_factorials_train.shape)
	X_test = X_test.dot(carregues_factorials_train) # X_test * carregues_factorials_train --> https://stackoverflow.com/questions/24560298/python-numpy-valueerror-operands-could-not-be-broadcast-together-with-shapes
	print(X_test.shape)
	print("###########################")
	print("X_entrenament: ",X_entrenament.shape) #(57,ncomponents)
	print(len(ll_mitjana_train))#(22791,)
	print(len(ll_desvest_train)) #(22791,) CONTE DESVIACIONS TIPIQUES DUTES DE TRAIN
	print(carregues_factorials_train.shape) #(22791,components)
	print("X_test_post pca: ",X_test) #9,ncomp
	print("###########################")
	print("Ara entrenant: ",nom_clf)
	##### CREO ESTADISTIQUES PER A CADA FOLD (DIAGNOSTIC ACCURACY) I LES VAIG GUARDANT PER CREAR LA CONFUSION MATRIX FINAL #########
	print("###### FOLD"+str(j)+" ######")
	print("#############################")
	####################################################################
	clf.fit(X_entrenament, y[train]) #pas 1 --> entreno les dades, per a cada fold
	probas_ = clf.predict_proba(X_test) #pas 2 --> testejo el model a les dades d'entreno (TREU MATRIU PROBABILITATS [p, 1-p] per a cada fold)
	y_predits_test = np.argmax(probas_, axis=1)
	###################################################################
	#DEFINEIXO LA CONFUSION MATRIX PER A CADA FOLD I IMPRIMEIXO ELS INDEXOS
	tn, fp, fn, tp = confusion_matrix(y[test], y_predits_test).ravel()
	accuracies_mitjanes_mal_procediment += (tp+tn)/(tp+tn+fp+fn)
	acc, s, e = (tp+tn)/(tp+tn+fp+fn), tp/(tp+fn), tn/(fp+tn)
	imprimeix_diagnostic_accuracy_a_cada_fold = True


	for k in range(len(probas_)):
	p, u_menys_p = probas_[k]
	predit, gold_standard = y_predits_test[k], y[test][k]
	f3.write(str(predit)+","+str(gold_standard)+","+str(p)+","+str(u_menys_p)+","+str(j)+'\n')#PREDICCIO,LABEL,P_mci_nc,P_mci_c,fold
	y_certes += [gold_standard]#per calcular brier scores
	y_predites += [predit]#per calcular brier scores
	if imprimeix_diagnostic_accuracy_a_cada_fold:
	print("probabilitats \|\| y_predits_Test \|\| y_labels_test")
	print(probas_,np.argmax(probas_,axis=1),y[test]) # FET PER MI_ IMPRIMEIXO PROBABILITATS, PROBABILITATS TRANSFORMADES A VALOR MAX (ARGMAX) I LABELS)
	print("\n#######################")
	print("Accuracy: {:.5f}".format(acc))
	print("Sensibilitat: {:.5f}".format(s))
	print("Especificitat: {:.5f}".format(e))
	print("Valor predictiu + (PPV): {:.5f}".format(tp/(tp+fp))) #tambe anomenat precision score. hi ha una funcio a sklearn que ho fa
	print("Valor predictiu - (NPV): {:.5f}".format(tn/(fn+tn)))
	print("LR+: {:.5f}".format(s / (1 - e)))
	print("LR-: {:.5f}".format((1 - s)/e))
	print("#######################")
	print("tn,fp,fn,tp: ",tn,fp,fn,tp)
	print("#######################\n")
	if fes_violin_plot_accuracies_per_folds_i_classificador:
	ll_acc_classificador += [acc]
	ll_sens_espe_classificador += [[noms_classificadors[i],"sensitivity",s]] #faig llista [[clsf[i], s1, e1],[[clsf[i], s2, e2],...,[clsf[i], sk, ek]] on k es el nombre de folds. Ha de tenir dimensio (50,3) --> 10 folds * 5 CLASSIFICADORS * 2 proporcions per fold (una sens i una esp.) = 100 files \| 3 variables (COMPROVAT)
	ll_sens_espe_classificador += [[noms_classificadors[i],"specificity",e]]




	#VAIG ACUMULANT ELS VALORS PER A CADA FOLD PER TAL DE TENIR UNA CONFUSION MATRIX FINAL, DESPRÉS DE LA VALIDACIÓ CREUADA
	suma_tn += tn
	suma_fp += fp
	suma_fn += fn
	suma_tp += tp


	# Compute ROC curve and area the curve
	fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1]) #PER QUE DIMONIS TREU DIVERSOS INDEXOS DE FPR i TPR PER A CADA FOLD? SI NOMÉS N'HI HA D'HAVER UN...
	tprs.append(interp(mean_fpr, fpr, tpr))
	tprs[-1][0] = 0.0
	roc_auc = auc(fpr, tpr)
	aucs.append(roc_auc)
	plt.plot(fpr, tpr, lw=1, alpha=0.3,
	label='ROC fold %d (AUC = %0.2f)' % (j, roc_auc))

	j = j + 1

	f3.close()





	if fes_violin_plot_accuracies_per_folds_i_classificador:
	dic_acc_per_fold_i_classificador[noms_classificadors[i]] = [] #CREO LES CLAUS DEL DICCIONARI, CAL INICIALITZAR-LES PRIMER RECORDA, SINO DONA ERROR
	dic_acc_per_fold_i_classificador[noms_classificadors[i]] = ll_acc_classificador #afegeixo al diccionari una llista amb les accuracies per fold

	#print("######################################");print("######################################",file=f4)
	#print("Accuracy mitjana (mitjana 10 accuracies): {:.4f}".format(accuracies_mitjanes_mal_procediment/10));print("Accuracy mitjana (mitjana 10 accuracies): {:.4f}".format(accuracies_mitjanes_mal_procediment/10), file=f4)
	print("\n\n##################################################");print("\n\n\n\n##################################################",file=f4)
	print("RESULTAT DESPRES DE SUMAR ELS FP,TN,TP,FN DELS 10 FOLDs\n ("+nom_clf+")");print("RESULTAT DESPRES DE SUMAR ELS FP,TN,TP,FN DELS 10 FOLDs\n ("+nom_clf+")",file=f4)
	print("####################################################"); print("####################################################", file=f4)
	brier_score = brier_score_loss(y_certes, y_predites)
	accuracy, s, e, ppv, npv = (suma_tp+suma_tn)/(suma_tp+suma_tn+suma_fp+suma_fn), suma_tp/(suma_tp+suma_fn), suma_tn/(suma_fp+suma_tn), suma_tp/(suma_tp+suma_fp), suma_tn/(suma_fn+suma_tn)
	LRmes, LRmenys = s/(1 - e), (1 - s)/e
	#faig intervals de confiansssa!
	print("\nAccuracy: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(accuracy, IC95(accuracy,n_samples)[0],IC95(accuracy,n_samples)[1]));print("\nAccuracy: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(accuracy, IC95(accuracy,n_samples)[0],IC95(accuracy,n_samples)[1]),file=f4)
	print("Sensibilitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(s,IC95(s,n_samples)[0],IC95(s,n_samples)[1]));print("Sensibilitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(s,IC95(s,n_samples)[0],IC95(s,n_samples)[1]), file=f4)
	print("Especificitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(e,IC95(e,n_samples)[0],IC95(e,n_samples)[1]));print("Especificitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(e,IC95(e,n_samples)[0],IC95(e,n_samples)[1]),file=f4)
	print("Valor pred + (PPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(ppv,IC95(ppv,n_samples)[0],IC95(ppv,n_samples)[1]));print("Valor pred + (PPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(ppv,IC95(ppv,n_samples)[0],IC95(ppv,n_samples)[1]), file=f4) #tambe anomenat precision score. hi ha una funcio a sklearn que ho fa#tambe anomenat precision score. hi ha una funcio a sklearn que ho fa
	print("Valor pred - (NPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(npv,IC95(npv,n_samples)[0],IC95(npv,n_samples)[1])); print("Valor pred - (NPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(npv,IC95(npv,n_samples)[0],IC95(npv,n_samples)[1]),file=f4)
	print("LR+: {:.5f}".format(LRmes));print("LR+: {:.5f}".format(LRmes),file=f4)
	print("LR-: {:.5f}".format(LRmenys));print("LR-: {:.5f}".format(LRmenys),file=f4)
	print("Brier score final: {:.3f} ".format(brier_score));print("Brier score final: {:.3f} ".format(brier_score),file=f4) #calibracio & performance del model
	print("###############C O N F U S I O N M A T R I X##############");print("###############C O N F U S I O N M A T R I X##############",file=f4)
	print(" alzh no alzh");print(" alzh no alzh",file=f4)
	print("sumatoris (tp,fp) \| +test {:2} {:2}".format(suma_tp, suma_fp)); print("sumatoris (tp,fp) \| +test {:2} {:2}".format(suma_tp, suma_fp),file=f4)
	print(" (fn,tn) \| -test {:2} {:2}".format(suma_fn,suma_tn));print(" (fn,tn) \| -test {:2} {:2}".format(suma_fn,suma_tn),file=f4)
	print("############################################################\n");print("############################################################\n", file=f4)

	#guardo les confusion matrix per fer un heatmap despres
	os.chdir("."+os.sep+carpeta)
	np.save("confusion_matrix__"+noms_classificadors[i], np.array([[suma_tp, suma_fp],[suma_fn, suma_tn]]))
	os.chdir("../")

	#GUARDO AL FITXER DE RESUM PER CLASSIFICADOR (QUE CREA EL TXT QUE ANIRA A LATEX A LA TAULA DE ACCURACIES)
	#accuracy&sensibility&specificity&ppv&npv&LR+&LR str(
	f1.write(noms_classificadors[i]+" & {:.2f} & {:.2f} & {:.2f} & {:.2f} & {:.2f} & {:.3f} & {:.3f}\\\\\n".format(accuracy100, s100, e100, ppv100, npv*100, LRmes, LRmenys))
	f2.write("{:.3f}".format(brier_score)+"& HOSMER SIMPSON\\\\\n") #https://www.data-essential.com/hosman-lemeshow-in-python/

	plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
	label='Luck', alpha=.8)
	mean_tpr = np.mean(tprs, axis=0)
	mean_tpr[-1] = 1.0
	mean_auc = auc(mean_fpr, mean_tpr)
	std_auc = np.std(aucs)
	plt.plot(mean_fpr, mean_tpr, color='b',
	label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
	lw=2, alpha=.8)
	std_tpr = np.std(tprs, axis=0)
	print("\n######################################");print("\n######################################",file=f4)
	#print("mean_tpr: ",mean_tpr,"std_tpr",std_tpr)
	print("mean_auc:",mean_auc, "std_auc",std_auc);print("mean_auc:",mean_auc, "std_auc",std_auc, file=f4)
	print("######################################\n");print("######################################",file=f4)
	tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
	tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
	plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
	label=r'$\pm$ 1 std. dev.')
	plt.xlim([-0.05, 1.05])
	plt.ylim([-0.05, 1.05])
	plt.xlabel('False Positive Rate (1-especificity)')
	plt.ylabel('True Positive Rate (sensitivity)')
	plt.title(nom_clf)
	plt.legend(loc="lower right")
	plt.show()

	f1.write("\\bottomrule")
	f2.write("\\bottomrule")
	f4.close()
	f1.close()
	f2.close()
	#UN COP HA ENTRENAT TOTS ELS MODELS CREO EL DATAFRAME D'ACCURACIES PER CLASSIFICADOR I EL PLOTEJO AMB SEABORN
	df_acc_per_classificador = pd.DataFrame(dic_acc_per_fold_i_classificador) #com a claus dels diccionaris hi ha els noms dels classificadors ja definits abans en la llista del principi
	sns.set(style="whitegrid")
	ax = sns.violinplot(data=df_acc_per_classificador, palette="muted", order=["Multilayer perceptron (ANN)","Logistic Regression","Support Vector Machines (linear kernel)", "Nearest Neighbours","Gaussian Naive Bayes"])
	plt.ylabel("Accuracy")
	plt.xlabel("Machine learning model (classifier)")
	#plt.ylim(0,1)
	plt.show()


	#FAIG ARA EL MATEIX AMB LES SENSIBILITATS I ESPECIFITATS PER CLASSIFICADOR
	plt.close()
	sns.set(style="whitegrid") #strin del model, si es Sensitivity o specificity, mesura (proporcio)
	df_sensEsp_per_classificador = pd.DataFrame(columns=["model","metric","p"], data=ll_sens_espe_classificador)

	ax = sns.violinplot(x="model", y="p", hue="metric",data=df_sensEsp_per_classificador, palette="muted",order=["Multilayer perceptron (ANN)","Logistic Regression","Support Vector Machines (linear kernel)", "Nearest Neighbours","Gaussian Naive Bayes"])
	plt.ylabel("sensibility/specificity")
	plt.xlabel("Machine learning model (classifier)")
	#plt.ylim(0,1)
	plt.show()

blackcub3s/MSc-FinalThesis

Folders and files

Latest commit

History

Repository files navigation

Description

Extended Summary

Abstract (for a more technical read)

Folders and files to comment

Footnotes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages