Skip to content

The most important programming files, code functions and data processing pipelines for the Machine learning final thesis of my Master's degree. Also, the LaTeX code of the thesis.

Notifications You must be signed in to change notification settings

blackcub3s/MSc-FinalThesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

This repository contains my MSc's final thesis, a quick explanation of its results, and some of the most important Python scripts that I made towards the completion of the thesis.

Extended Summary

My final thesis “Development of a single-Subject Predictive model of Alzheimer's disease using fMRI and machine learning techniques in individuals with Mild Cognitive Impairment” was an endeavour that involved working with big data, fMRI and machine learning models. I used Python to carry out the data analysis from data retreived from a neuroimaging project called ADNI1.

For the data analysis I took a multi processing pipeline approach, using a leave one-out cross-validation procedure (the data was scarce so there was a need to reduce dimensionality). In order to select the best fitting algorithm, there were several processing pipelines, and for each one of them a different machine learning model was fitted: artificial neural network (multilayer perceptron), logistic regression, naive bayes, random trees, etc.

The conclusions of the master thesis were that Alzheimer's disease has potential to be predicted by using fMRI in a population at risk for developing it (mild cognitive impairment) with a decent accuracy. This can be done using machine learning and an abstract computational neuroscience concept called functional connectivity (an indirect measure on how interconnected a brain is). However, further research should be done and results are to be interpreted with a grain of salt, given that although best fitting model accuracy was good and specificity was high, the sensitivity of the classification was low, probably due to the low sample size of mild cognitive impairment patients who would turn into Alzheimer’s disease as opposed to the sample size of those who wouldn’t.

  • Tech stack: Python 3. Used modules: sklearn (machine learning), matplotlib (data visualization), seaborn (enhanced layer for matplotlib), numpy (matrix laboratory), pandas (data science). In order to write my final thesis I used LaTeX as, to the best of my knowledge, is the highest typographic quality system available as open source.

Abstract (for a more technical read)

In the last years scientists have tried to develop predictive models for forecasting a future onset of Alzheimer's disease (AD) in people with Mild Cognitive Impairment (MCI), using fluid, imaging and neuropsychological biomarkers. To the best of our knowledge, the question of whether functional magnetic resonance imaging (fMRI) can serve as a viable predictor for the aforementioned disease in MCI individuals remains unanswered. We have developed a single-subject predictive model using, for each individual, solely a rsfMRI scan obtained at the screening visit of the ADNI and follow-up longitudinal data about future outcomes (AD presence/absence). We included in our model 23 MCI-c (mean time until conversion: 1.65 years) and 51 MCI-nc patients (mean time of follow-up: 4.61 years). Scan preprocessing pipelines consisted on registering each scan to Shen's Atlas (214 ROIs), extract functional connectivity measures and use them as predictors (either with or without dimensionality reduction) and pair them with the future outcomes to train and test supervised machine learning models via 10-fold cross-validation. We found conversion from MCI to AD can be predicted from rsfMRI with a multilayer perceptron with no dimensionality reduction with a reasonable accuracy (77.03%, 95% CI from 67 to 87%), good ROC AUC (0.81), very high specficity (specMLP = 90.20, 95% CI from 83 to 97%) but weak sensitivity (47.83%, 95% CI from 36 to 59%). Logistic regression and the Support Vector machines also obtained reasonable diagnostic accuracies (75.68%, both of them). The models cannot be deployed to clinical practice yet: further research is needed to increase sensitivity.

Folders and files to comment

The most important files and folders the reader can refer to, to see either the full thesis or the code with which I got the results from the reader can be headed to any or all of the following folders or files (see thorough explanation for each relevant file):

  • TFM_FINAL_santiagosanchez.pdf: the final thesis as was handed in to the university.

  • Writing: Contains the LaTeX files and figures that were put together in order to compile * TFM_FINAL_santiagosanchez.pdf *

  • DataAnalysis: this folder contains the most important programs I made in Python to do the analysis. Within the sub folder inContext you'll find the same programs, but in context with some of the raw data behind (and other files that contributed to the development of the final thesis)2. The most important files are linked here and you can look into them:

    • mira_dades.py: In this script the most important functions are these:

      1. Merge two datasets with complementary information: you can see IMPORTANT_depura_ADNI_MERGE_i_ARXIU_fMRIs()
      2. See which subjects convert to alzheimer's and which do not: see temps_de_conversio(SUBMODALITAT_fMRI)
    • a. fes_merge_en_tensor_3d (obte l'array 3d).py: In this file the most important snippet of code is the funcion fes_merge(n_timeseries) (see down below).

      def fes_merge(n_timeseries):
      """
      GUARDATS.
      arr_3d --> numpy ndarray amb shape [93,214,140] >-->[SUBJECTES, ROIsSHEN,TIME SERIES]
      ll_subs = [SUBJECTE 1, SUBJECTE 2, ... , SUBJECTE 93]
      ll_uids = [UID 1, UID 2, ..., UID 93]
      NOTA:
      les llistes son LA INFO dels subjectes. son correlatives a la primera dimensio (dimensio de subjectes)
      de la matriu arr_3d.
      Qualsevol dels dos valors són útils per identificar els subjectes. UID es un identificador unic que hi ha
      per a cada escaner de l'ADNI.
      """
      j = 0
      carpetes = os.listdir(os.getcwd())
      carpetes.sort()
      ll_matriu_3d = []
      ll = []
      ll_subs = []
      ll_uids = []
      for carpeta in carpetes:
      if "." not in carpeta:
      j = j + 1
      ids = carpeta.split("__")
      #GUARDO ELS IDS DELS SUBJECTES I ELS LL_UIDS (correlatius a la primera dimensió de arr_3d)
      ll_subs += [ids[0]]
      ll_uids += [ids[1]]
      os.chdir("./"+carpeta+os.sep+"tc")
      print("Subjecte: "+carpeta)
      print("\n")
      ll_matriu_subjecte= []
      k = 0
      with open("total.txt","r") as f:
      for linia in f:
      k = k +1
      ll_linia = linia.split() #fora salts de linia i espais
      for i in range(len(ll_linia)):
      ll_linia[i] = float(ll_linia[i]) #passem tipus de dades a real
      if not len(ll_linia) == n_timeseries:
      raise ValueError("Compte! Que el subjecte i escaner amb id "+carpeta+" \nno té les "+str(n_timeseries)+" time series que esperaves")
      ll_matriu_subjecte = ll_matriu_subjecte + [ll_linia] #ll_matriu_subjecte conte matriu (214,140) (ROIs, TS)
      print("Subj_escan "+carpeta+" || "+"TS (columnes): "+str(len(ll_matriu_subjecte[0]))," | ROIs (files): "+str(len(ll_matriu_subjecte)))
      os.chdir("../../")
      ll_matriu_3d = ll_matriu_3d + [ll_matriu_subjecte] #COMPROVA QUE TINGUI [93,214,140] >-->[SUB,ROIs,TS]
      ll = ll + [k]
      #if j == 10:
      # break
      print(len(ll_matriu_3d),len(ll_matriu_3d[0]),len(ll_matriu_3d[0][0]))
      arr_3d = np.array(ll_matriu_3d) #COMPROVA QUE TINGUI [93,214,140] >-->[SUB,ROIs,TS]
      print("TOTS ELS SUBJECTES HAN DE TENIR ELS MATEIXOS VALORS EN CADA ELEMENT DE LA SEGUENT LLISTA. SI NO, CAL BORRAR TOTAL.TXT, REEXECUTAR MAKEVECTOR I TORNAR A EXECUTAR FES_MERGE")
      print(ll)
      #FINALMENT GUARDEM LA NUMPY ARRAY I TAMBÉ ELS NOMS DELS SUBJECTES I ELS UIDS DELS ESCANERS
      # EN LLISTES CONTINGUDES EN TXTS, CORRELATIVES A LA PRIMERA DIMENSIO DE ARR_3D (DIMENSIO STACK)
      np.save("__arr_ADNI_3d_preprocessada",arr_3d)
      f1 = open("__subjectes.txt","w")
      f2 = open("__uids.txt","w")
      for i in range(len(ll_subs)): #HAN DE TENIR LA MATEIXA LLARGARIA LES LLISTES LL_SUBS I LL_UIDS.
      f1.write(ll_subs[i]+'\n')
      f2.write(ll_uids[i]+'\n')
      f1.close()
      f2.close()
      return "\nshape array subjectes: "+str(arr_3d.shape)+"\nnresubjectes: "+str(len(ll_subs))+"\nnre uids: "+str(len(ll_uids))
      print(fes_merge(140)) #Comprova que les times series son exactament les que posem al parametre de la funcio per a tots els subjectes. SI no fos el cas, es queixaria i donaria un VAlueerror

      On the one hand, this function went to each folder name of syntax "SUBJECTID__UID" and got the "SUBJECTID" (identifier for subject) and "LLUID" (identifier for scanner type made to a given subject) to save them in list objects and finally store them in respective .txt files (__subjectes.txt and __uids.txt). On the other hand, the function also went into each subject folder to find total.txt a file that contained fMRI data extracted from .NIFTI files after registering that information to Shen's atlas3 via FSL. The information inside each subject's total.txt was structured as shape (214,140) (the first dimension being the Regions of interest (ROIs4) of Shen's atlas -214- and the second one was the number of time series data points available -140 time series data points for each ROI-) and finally I stacked together all these matrices as a numpy ndarray object (.npy) to have it all within one single file (__arr_ADNI_3d_preprocessada.npy) of shape (93,214,140)5.

    • b. analisisfinal.py: The most important functions of this file are: fes_dataframe_despres_dels_3_criteris_dinclusio(d): This function narrowed down the subjects that went into the machine learning models according to inclusion criteria of the methods section of my final thesis.

      def fes_dataframe_despres_dels_3_criteris_dinclusio(desv_estandar_de_tall):
      """
      Funcio que elimina part dels 93 subjectes segons els tres criteris d'inclusió
      que haviem decidit aplicar a l'apartat 2.3.2 Inclusion and exclusion criteria del protocol / apartat del mètode i també els d'exclusió.
      INPUT: Valor de desviació estandar a sumar a la mitjana de la distribució de temps
      que el subjecte MCI-C tarda a convertir-se, que serveix com a anys de tall per a
      la distribució de MCI-NCs
      RETURNS: pandas dataframe REDUIT, 2 dataframes redundants (que tenen el mateix, pero separats epr MCI-C i MCI-NC)
      0) COMENÇO AMB EL CRITERI EXCLUSIO: is consistency will be evaluated as the absence of diagnostics that lower in severity after already having been diagnosed as rollover: i.e. if one MCI participant is later diagnosed as AD but in the next follow-up it comes back to being diagnosed as MCI or even diagnosed as Healthy the subject will be excluded.
      a) Participants that have an fMRI scan of the same submodality.
      b) Those scans will need to be within at least a \textit{$\pm 2 $ month interval} from the baseline diagnostic (a excpeció dels rollover). FA FALTA JUSTIFICAR B LA DIFERENCIA ENTRE ELS FOLLOW-UP MES PROPERS PER ALS ADNI 1 I LA EXAM DATE
      c) Among this sample, at least 80\% of the individuals need to have either CSF total, or CSF A$\beta_{1-42} $ or ADAS or MMSE scores to allow a multimodal approach according to \textbf{\textit{$H_{2}$}}.
      d) MCI-NC individuals must have been in a minimum of \textbf{$n$ years of a follow-up}.
      The value $n$ will be chosen by sorting in descending order
      follow-up times in MCI-NC and eliminating those subjects whose time value is below the mean time
      of conversion for MCI-C + 1.5 standard deviations". Means and SDs come from the initial 332 MCI patient sample.
      MCI-C [time until conversion]---> ($x$ = 2.3 years; $s$ = 1.85) --> value n calculation = 2.3 + 1.85*1 --> minimum of 4.15 years of followup
      MCI-NC[time until loss of followup]---> ($x$ = 4.69 years; $s$ = 2.357)
      """
      df = pd.read_csv("out_df_final_pleedeADNIMERGE.csv") # --> té 93 subjectes
      # criterion a) son 93
      registres = df.shape[0]
      swap = registres
      print("criteri inclusio a: inclosos "+str(registres)+" (d entre els 332 de la seccio de metodes).")
      # EXCLUSION CRITERION Out of the 93 initial subjects only four present a lowering in severity after an AD diagnosis. Among those the exclusion criteria (see two definitions of "inconsistency" in the text) were applied.
      #df = df[df.PTID != '002_S_4746'] # WE DO NOT EXCLUDE HIM --> 002_S_4746 (4 consecutius de MCI (amb 2 missings entre mig). Dpesres 1 a AD i inmediatament despres de nou a MCI i depsres perdua followup.
      df = df[df.PTID != '031_S_4005'] # WE MUST EXCLUDE THEM FOR TWO MCI DIAGNOSIS AFTER THE AD ONE! --> 031_S_4005 xx index 1 (12 mesos per conv)-- DOLENT -------------- (First 2 follow-ups AD. Then Goes back to MCI (2 consecutive followups corroborate the MCI and Then follow-up stops.
      df = df[df.PTID != '019_S_4293'] # EXCLUDE THEM FOR OSCILATING PATTERN --> 019_S_4293 (24 mesos per conv) t (mci --> ad --> mci --> ad)
      df = df[df.PTID != '031_S_4947'] # EXCLUDE THEM FOR OSCILATING PATTERN -->031_S_4947 x -- POTSER DOLENT ----(MCI --> x --> AD --> MCI --> AD --> loss followup) --> varia, oscila ??
      registres = df.shape[0]
      print("criteri exclusio (inconsistencia):\n ara queden "+str(registres)+" (dels originals "+str(swap)+" de metodes): \no sigui, "+str(swap - registres)+"exclosos.")
      # INCLUSION CRITERION
      #time.sleep(5)
      #criterion b) Check whether at least 80 % of subjects have +- 2 months difference from bl.dx vs (examdate), EXCEPT those ADNI 1 (because dx.bl was long time ago).
      df_no_ADNI1 = df[df["fase_estudi"] != "ADNI_1"]
      MCIc_seguits_minim_2mesos = df_no_ADNI1[df_no_ADNI1["dif_examdate_vs_dxbl"].abs() <= 60][df["s_CONVER"] == 1].shape[0]
      MCInc_seguits_minim_2mesos = df_no_ADNI1[df_no_ADNI1["dif_examdate_vs_dxbl"].abs() <= 60][df["s_CONVER"] == 0].shape[0]
      MCIc, MCInc = df_no_ADNI1[df_no_ADNI1['s_CONVER'] == 1].shape[0], df_no_ADNI1[df_no_ADNI1['s_CONVER'] == 0].shape[0]
      print("CRITERI EXCLUSIO B requereix 80% min") #NO BORREM A NINGU
      print(" percentatge MCIc amb <60 dies bl vs examdate: ", MCIc_seguits_minim_2mesos/MCIc) #90%
      print(" percentatge MCIncc amb <60 dies bl vs examdate: ", MCInc_seguits_minim_2mesos/MCInc) #91 %
      #time.sleep(2)
      # CROTERION c) defined below (we cut)!
      # STEP 1) SPLIT MCI_C and MCI_NC.
      df_MCIc, df_MCInc = df[df['s_CONVER'] == 1], df[df['s_CONVER'] == 0] #26 MCI_C i len 67 MCI_NC respectivament
      # STEP 2) Delete MCI_NC with follow up times below the one defined in methods section (2.3 + 1.85 * 1 = 4.15 years of minimum follow up)
      # This is done to delete those MCI_NC who are prone to be false negatives (see -threshold n- listed and below)
      n = 2.3 + 1.85 * desv_estandar_de_tall #YEARS THRESHOLD CUT-OFF
      df_MCInc, df_MCInc_NOINCLOSOS_PERCRITERI_C = df_MCInc[df_MCInc["DIES-SEGUIM"] > n*365.25], df_MCInc[df_MCInc["DIES-SEGUIM"] <= n*365.25]
      #df_MCIc = df_MCIc[df_MCIc["DIES-SEGUIM"] >= 365] Amn aixo pots tallar els MCInc per damunt o per sota d'un temps de conversio. Empitjora la sensibilitat.
      # REPORTING INFORMATION FOR INCLUSION CRITERION D.
      print('{:.0f} selected MCInc with follow-up time above cutoff in *years\n{:.0f} selected MCIc'.format(len(df_MCInc),len(df_MCIc)))
      print("* "+str(n)+" punt de tall -en anys- de minim follow-up assegurat per a als MCI_nc")
      print('El cutoff de n fa excloure un total de '+str(len(df_MCInc_NOINCLOSOS_PERCRITERI_C['PTID'].tolist()))+" subjectes (concretament... DESMARCACOMENTARI I MIRA "+str(df_MCInc_NOINCLOSOS_PERCRITERI_C['PTID'].tolist()))
      time.sleep(1)
      # STEP 3) WE CREATE a df with two keys AND ALSO A NEW CSV for analysis with other programs (such as SPSS)
      df = pd.concat([df_MCIc, df_MCInc])
      # bonus: WE ADD EXACT SUBJECT_AGES AT THE MOMENT OF ACQUIRING THE FMRI SCANS! (THEY COME FROM XML FILES )
      df_xmls = pd.read_csv("in__resum_parametres_fmri_EDATS_EXACTES_EN_MOMENTS_D'ESCANER.csv")
      df_uid_edat = df_xmls[['UIDs','AGE_ambdecimals']]
      df = pd.merge(df,df_uid_edat, how = "left", on="UIDs",sort=False)
      #bonus 2: WE WANT TO KNOW THE TIME THAT GOES FROM SCANNING TIME AND MOMENT OF END OF FOLLOW UP(MCI-NC)/CONVERSION (MCI-C).
      # EXAM-DATE VARIABLE GIVES US THE TIME THAT GOES FROM BL.DX UNTIL END OF FOLLOUP / CONVERSION, and it is pretty informative
      # for all participants, except those who do not belong to the ADNI 1 originally. For those DIES-SEGUIM are way too big.
      # Then we have dif_examdate_vs_dxbl, which gives us the diference examdate (scan time) vs baseline-diagnostic.
      df['FI-FOLLOWUP-O-CONV_vs_data-escaneig'] = pd.Series((df["DIES-SEGUIM"] - df["dif_examdate_vs_dxbl"])/365.25)
      # Thus, to know the time SINCE rsfMRI scan until conversion/loss of followup for all subjects we simply need to compute the new variable
      # doing DIES-SEGUIM - dif_examdate_vs_dxbl will. NEW VARIABLE will be called ENDFOLLOW-or-CONV_vs_data-escaneig
      #EVALUEM QUINS CENTRES HI HA
      print("CENTRE, FREQUENCIA PER CENTRE")
      print(df["SITE"].value_counts())
      #EVALUEM QUINS CENTRES HI HA (GRUP MCI-C)
      print("CENTRE, FREQUENCIA PER CENTRE (MCI-c)")
      print(df["SITE"][df["s_CONVER"]==1].value_counts())
      #EVALUEM QUINS CENTRES HI HA (GRUP MCI-nc)
      print("CENTRE, FREQUENCIA PER CENTRE (MCI-nc)")
      print(df["SITE"][df["s_CONVER"]==0].value_counts())
      time.sleep(1)
      #guardem el df a csv
      df["ABETA"] = df["ABETA"].replace(to_replace = ">1700", value = '1701').astype(float) #treiem primer el caracter especial superior o igual que dels valors ">1700" i ho substituim per un valor factible: 1701] cal passar a float perquè despres de la conversio tot esta en string i no se pq
      df["ABETA/PTAU"] = df["ABETA"]/df["PTAU"] # [creem la nova variable ABETA/PTAU, que suposadament es un index que podria contribuir al diagnostic d'acord amb la lite previa (compte que no sabem si es abeta 1 42, crec que es abeta general)]
      df.to_csv("out_df_final_pleedeADNIMERGE_postcriterisInclusio.csv",sep = ",", index = False)
      #fem estadística descriptiva per variables importants dins ll_variables
      v_quantis=[('escan_vs_fi','FI-FOLLOWUP-O-CONV_vs_data-escaneig'),("AGE","AGE_ambdecimals"),("MMSE","MMSE"),("TAU","TAU"),("PTAU","PTAU"),("ABETA","ABETA"),("AB/PTAU","ABETA/PTAU"),("FDG","FDG"),("ADAS11","ADAS11"),("ADAS13","ADAS13"),("ADASQ4","ADASQ4")]#[(label_variable, nom_variable_dins_Dataframe)"}]
      v_categ = [("EMCI/LMCI","BL.dx"),("MALE/FEMALE","PTGENDER")]
      # criterion d)
      # CHECKED MANUALLY. no subjects excluded. dins els MCIc 1/23 (96%) no tenien tau, ptau o beta a. I dins els MCInc eren 3/51 (94%)
      #__________INFORMEM DELS NaNs__________________________
      print("\n##############################")
      print("_____NaNs per variable_______")
      for variable in v_quantis:
      label_variable, nom_variable_dins_Dataframe = variable
      nre_NaNs_a_variable = df[variable[1]].isna().values.tolist().count(True)
      print(variable[0],": ",nre_NaNs_a_variable," sobre "+str(len(df))+"({:.2f} %)".format(100*nre_NaNs_a_variable/len(df))+")") #Conto els NaNs
      time.sleep(1)
      #______Evaluem normalitat i homocedasticitat de les proves per a cada variable quantitativa en cada subgrup (MCI-C i MCI-NC) (condicions aplicacio t-test!)_________
      print("\n SHAPIRO-WILK (NORMALITAT) ||| levene ")
      ##https://es.wikipedia.org/wiki/Test_de_Shapiro%E2%80%93Wilk |
      print(" MCIc MCInc ||| MCIc vs MCInc ")
      t_test = [] #guardo les variables que requereixen ttest. les altres fem mann whitney.
      for variable in v_quantis:
      label_variable, nom_variable_dins_Dataframe = variable
      variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna()
      shapiro_MCIc, p_shapiro_MCIc = stats.shapiro(variable_MCIc); shapiro_MCInc, p_shapiro_MCInc = stats.shapiro(variable_MCInc)
      levene_statistic, p_levene = stats.levene(variable_MCIc, variable_MCInc)
      if p_shapiro_MCIc < 0.05 or p_shapiro_MCInc < 0.05: #Si la variable NO ES NORMAL segons shapiro wilk
      if p_levene < 0.05: #si no hi ha homogeneitat de variances
      print('{:7} {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f})* ||| {:.2f} (p = {:.3f})+[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
      t_test += [False]
      else:#si hi ha homogeneitat de variances
      print('{:7} {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f})* ||| {:.2f} (p = {:.3f})[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
      t_test += [False]
      else:#si la variable es normal segons shapiro wilk
      if p_levene < 0.05:#si no hi ha homogeneitat de variances
      print('{:7} {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f}) ||| {:.2f} (p = {:.3f})+[mw]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
      t_test += [False]
      else: #si hi ha homogeneitat de variances
      print('{:7} {:.2f} (p = {:.5f}) | {:.2f} (p = {:.5f}) ||| {:.2f} (p = {:.3f}) [t]'.format(label_variable,shapiro_MCIc, p_shapiro_MCIc, shapiro_MCInc, p_shapiro_MCInc, levene_statistic, p_levene))
      t_test += [True]
      print("_____________________________________________")
      print("Ho (shapiro wilk): Poblacio del subgrup donat, per a la variabledonada, \nesta normalment distribuida. L'Asterisc surt quan refutem Ho en qualsevol dels dos grups, o sigui, quan qualsevol dels dos grups no mostra normalitat.")
      print("Ho (levene_test): els dos subgrups tenen homogeneitat de variances per a la variable donada. La creu + surt quan refutem Ho, o sigui quan no hi ha homogeneitat de variances.")
      print("[t] apareix quan la variable es susceptible de comparar-se entre grups amb t-test (exigeix homocedasticitat entre subgrups i distrib normal dins cada subgrup)")
      print("[?] Si no hi ha homogeneitat de variances pero la distribucio es normal... aleshores no se si aplicar mann whitney o t-test...")
      #__________CREEM LA TAULA 1 DE DESCRIPTIUS_____________
      print("\n *********************** DESCRIPTIUS ***********************")
      print("\n#################################################################")
      print("__________MCI-C___________________MCI-NC__________valor (p-value)")
      # PRIMER VAN LES QUANTITATIVES
      for i in range(len(v_quantis)):
      variable = v_quantis[i]
      label_variable, nom_variable_dins_Dataframe = variable
      variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna()
      valor_contrast_hipotesis, p_valor, boolea_t_test = funcions_auxiliars.bivariant_comparacio_mitjanes(variable_MCIc, variable_MCInc, t_test[i])
      if boolea_t_test:
      print("{:10}".format(label_variable), "{:.1f} ({:.1f}) {:.1f} ({:.1f}) {:.2f} (p = {:.5f})[t] ".format(variable_MCIc.mean(), variable_MCIc.std(), variable_MCInc.mean(), variable_MCInc.std(),valor_contrast_hipotesis, p_valor))
      else:
      print("{:10}".format(label_variable), "{:.1f} ({:.1f}) {:.1f} ({:.1f}) {:.2f} (p = {:.5f})[mw] ".format(variable_MCIc.mean(), variable_MCIc.std(), variable_MCInc.mean(), variable_MCInc.std(),valor_contrast_hipotesis, p_valor))
      #DESPRES LES CATEGORIQUES
      for variable in v_categ:
      label_variable, nom_variable_dins_Dataframe = variable
      variable_MCIc, variable_MCInc = df[nom_variable_dins_Dataframe][df['s_CONVER']==1].dropna(), df[nom_variable_dins_Dataframe][df['s_CONVER']==0].dropna()
      if label_variable == "EMCI/LMCI":
      EMCIs_MCIc, LMCIs_MCIc, EMCIs_MCInc, LMCIs_MCInc = variable_MCIc[variable_MCIc == 3].count(), variable_MCIc[variable_MCIc == 4].count(), variable_MCInc[variable_MCInc == 3].count(), variable_MCInc[variable_MCInc == 4].count()
      valor_contrast_hipotesis, p_valor = 99.99, 0.999
      print("{:10}".format(label_variable), "{:10} {:10} {:.2f} (p = {:.5f})[?] ".format(str(EMCIs_MCIc)+"/"+str(LMCIs_MCIc), str(EMCIs_MCInc)+"/"+str(LMCIs_MCInc), valor_contrast_hipotesis, p_valor))
      if label_variable == "MALE/FEMALE":
      male_MCIc, female_MCIc, male_MCInc, female_MCInc = variable_MCIc[variable_MCIc == "Male"].count(), variable_MCIc[variable_MCIc == "Female"].count(), variable_MCInc[variable_MCInc == "Male"].count(), variable_MCInc[variable_MCInc == "Female"].count()
      valor_contrast_hipotesis, p_valor = 99.99, 0.999
      print("{:10}".format(label_variable), "{:10} {:10} {:.2f} (p = {:.5f})[?] ".format(str(male_MCIc)+"/"+str(female_MCIc), str(male_MCInc)+"/"+str(female_MCInc), valor_contrast_hipotesis, p_valor))
      print("________________________________________________________________")
      print("[t] son t-test per a mostres independents. \n [mw] son Mann whitney")
      print("**COMPTE! Per a aquestes proves de comparacio de mitjanes els NaNs s'han eliminat. Diferent al que farem per al multimodal aproach, que sera mean o median imputation per a cada subgrup.")
      print("? te les odds... pero podria posar proporcions crec")
      print("________________________________________________________________")
      print("################################################################")
      time.sleep(1)
      return df, df_MCIc, df_MCInc

      The most important inclusion criteria within that function was how I decided to filter out the subjects who would not convert to Alzheimer's (see the footnote to understand the reasoning that was to be followed6). The idea was to choose the criterion of considering those who wouldn't turn to alzheimer's (MCI-nc) as those subjects who hadn't turn to alzheimers above the time period $t$, such that: $t &gt; \mu + n\rho$

      Where

        μ = 2.3 years (average time of turning to alzheimers for those who turn)
        ρ = standard deviation of those who turned to AD
        n = 1.85 (arbitrary number, but if we consider the distribution of time for MCIc to be normal, that would way above the central tendency of the data).
      

      In the following code, you can see I stored the subjects I would finally analyse within the df_MCInc dataframe, by writing this:

      # STEP 2) Delete MCI_NC with follow up times below the one defined in methods section (2.3 + 1.85 * 1 = 4.15 years of minimum follow up)
      # This is done to delete those MCI_NC who are prone to be false negatives (see -threshold n- listed and below)
      n = 2.3 + 1.85 * desv_estandar_de_tall #YEARS THRESHOLD CUT-OFF
      df_MCInc, df_MCInc_NOINCLOSOS_PERCRITERI_C = df_MCInc[df_MCInc["DIES-SEGUIM"] > n*365.25], df_MCInc[df_MCInc["DIES-SEGUIM"] <= n*365.25]

      mat_corr_labels_i_pipeline(): This function took the numpy ndarray arr_TS with shape (57,214,140) (57 subjects with 214 ROIs each and 140 Time Series per ROI) applied the Functional Connectivity definition. The Functional Connectivity (FC) is a construct that tells how well connected a brain is. Here, the FC of one given subject was defined as all possible correlations between the time series of the subject's ROIs, vectorized in a single line. Namely, for each matrix (n,m) = (214,140) we would get a (n,n) = (214,214) pearson correlation matrix (or the adjacency matrix of a complete graph with 214 nodes) called arr_cor and stored as out_FC(57 subjectes).npy7 as obtained in this lines of code:

      def mat_corr_labels_i_pipeline(ploteja_totes_les_matrius_de_conectivitat, ploteja_una_matriu_de_conectivitat_mitjana_per_cada_grup, ploteja_funcions_densitat_probabilitat_FCON_per_centre):
      """
      arr_TS, ll_sub_txt, ll_uids i df tenen dimensions correlatives.
      """
      t1 = time.clock()
      arr_TS, ll_sub_txt, ll_uids, df = redueix_i_organitza() #arr_TS --> (57,214,140) (SUB, ROI, TS)
      arr_TS = np.transpose(arr_TS,(0,2,1)) # --> arr_TS ara te shape (57,140,214) (SUB, TS, ROI)
      #CREO VARIABLE arr_corr, que contindrà un stack de matrius de correlacions (tantes com subjectes)
      arr_corr = []
      for i in range(len(arr_TS)):
      ll_arr_corr = pd.DataFrame(arr_TS[i]).corr().values.tolist() #ll_ARR_CORR --> (214,214)) --> faig pas DF --> NDARRAY --> LIST per poder fer l'stack bé (no se com fer-ho en numpy)
      arr_corr += [ll_arr_corr]
      arr_corr = np.array(arr_corr) # shape (57,214,214) (Shape pot canviar en funcio del parametre de desviacio tipica, recorda)
      np.save("out_FC ("+str(len(arr_corr))+" subjectes)", arr_TS)
      print("arr_corr shape", arr_corr.shape)

      And after that, we would take the arr_corr for each $i$ subject in a for loop (arr_corr[i]) and obtain the flattened vector with the pearson correlations called ll_vector and stored within the X matrix that will go to the machine learning model input matrix. ll_vector will have shape $(1/2) \cdot n(n-1)$:

      #CREO VARIABLE X per a a fer ML. X conté la F.Connectivity vectoritzada, o sigui conté
      #CADA MATRIU DE l'stac arr corr queda vectoritzada i guardada com a matriu 2D
      X = []
      for i in range(len(arr_corr)):
      ll_vector = funcions_auxiliars.elimina_tri_superior_i_flateneja_ROIxROI(arr_corr[i])
      X += [ll_vector]

      The auxiliary function call in line 378 that allows us to do obtain ll_vector (ll_ROIxROI_flattened within the function) is here:

      def elimina_tri_superior_i_flateneja_ROIxROI(m_correl):
      """
      PRE: This method takes a numpy array that contains a matrix of correlationsnxn called m_correl (ROIxROI)
      POST: Returns ll_ROIxROI_flattened (i.e upper triangle of m_correl and the main diagonal gets eliminated, and the lower triangle vectorized and
      returned as a a list type object as a ROW of (1/2) * n (n-1) correlations (elements). Thus, If m_correl input size us 214x214 -shen atlas- you will
      get 22,791 valies within ll_ROIxROI_flattened
      NOTE: This Does the same as numpy.tril function
      """
      #m_correl es dataframe de pandas. com que vull operar com a llista... || FAIG A LA SEGUENT LINIA LE PAS dataframe ----> array numpy ----> llista
      ll_correl = m_correl.tolist() #ndarray --> llista
      ll_ROIxROI_flattened = []
      for i in range(1,len(ll_correl)):
      bufer = ll_correl[i][:i]
      ll_ROIxROI_flattened = ll_ROIxROI_flattened + bufer
      return ll_ROIxROI_flattened

      Finally we obtain the Y matrix, with the labels (wheter or not the MCI will convert to AD or no, with a 1 or 0 respectively) and run the machine learning models:

      #CREO VARIABLE Y, que contindrà les labels (1 --> MCI-C | 0 --> MCI-NC)
      Y = df["s_CONVER"].values; print("Y shape: ", Y.shape) # Y te shape (57,)
      print("Y shape: ",Y.shape)
      time.sleep(2)
      #ANALISI INDEPENDENT 2) FEM 10-fold-crossvalidation + SVM --> RESULTATS MOLT POBRES AMB PCA, tant estandaritzant com sense fer-ho. No diagnostica cap malalt, specificiat 0 amb PCA. probats varis components
      apc.PCA_10foldCV_ROC(X=X,
      y=Y, # NOTA_ PCA_by_santi i rfecv son mutuament excloents
      PCA_by_santi=False, # primer boolea True si vols fer PCA (false si no el fas)#Si vols estandaritzar columnes en fer el PCA (Si no el fas es igual si es True o false)has d'anar a dins el fitxer i buscar "estandaritzar_variables"
      rfecv=False, # no funciona... si poses True hauria de fer Recursive feature elimination. Si poses False no te la fa
      fes_violin_plot_accuracies_per_folds_i_classificador=True,
      carpeta='BIOMARCADORS_models_finals') # SI TOQUES COSES DE fMRI treu-ho a la carpeta 'fMRI_models_finals'
      #NOTA: LA CARPETA ON GUARDARAS TOTES LES ACCURACY METRICS. Ha d'existir primer o dona error.

      apc.PCA_10foldCV_ROC(X=X .. ) is the function call to the PCA_10foldCV_ROC of another file, which are described down below:

    • aux__plot_roc_crossval.py: As we said, this is the file where the machine learning models are run. 10-fold-crossvalidation, PCA, Artificial Neural networks., etc. Within this file the most important function is the one we called at line 410 in the previous file above and goes as follows:

      def PCA_10foldCV_ROC(X,y,PCA_by_santi,rfecv,fes_violin_plot_accuracies_per_folds_i_classificador,carpeta):
      from aux__reduccio_dimensions import Analisi_components_principals as dades_acp
      """
      X = Dades d'entrada
      y = labels
      PCA = BOOLEA sobre si vols PCA o no abans de introduir aquestes dades al model
      RFECV = BOOLEA sobre si vols aplicar Recursive Feature Elimination per a cada classificador o no
      """
      #n_samples, n_features = X.shape
      n_samples = X.shape[0]
      ################################## Classification and ROC analysis#########################
      #Posa els noms dels classificadors i els noms dels objectes. Aixi pots iterar en cada un d'ells i fer una 10-fold CV per a cada un.
      noms_classificadors = ["Logistic Regression",
      "Gaussian Naive Bayes",
      "Support Vector Machines (linear kernel)",
      "Nearest Neighbours",
      "Multilayer perceptron (ANN)"]
      objectes_classificadors = [LogisticRegression(C=10000, penalty='l2', multi_class= 'multinomial', solver='lbfgs', class_weight = "balanced"),
      GaussianNB(),
      svm.SVC(kernel='linear', probability=True, random_state=1, class_weight = "balanced"),
      KNeighborsClassifier(3),
      MLPClassifier(alpha=1, hidden_layer_sizes=(100,), activation='relu', solver='sgd')] #PROVA ADAM
      #aqui guardo les accuracies per classificador, i en faig un violin plot després.
      #NOTA QUE AMB AIXO PUC CREAR FACILMENT UN PANDAS DATAFRAME I FER EL VIOLINPLOT DESPRÉS
      dic_acc_per_fold_i_classificador = {} #{"Logistic Regression":[FLOAT_accuracy_fold_1, FLOAT_accuracy_fold_2, [...], FLOAT_accuracy fold 10], "Gaussian Naive Bayes":[], ..., "Multilayer perceptron (ANN)":[]}
      #CREEM ELS FITXERS ON GUARDAR LES METRIQUES PER CADA MODEL (TANT LES MESURES DE DIAGNOSTIC ACCURACY COM LES BRIER SCORES)
      f1 = open("./"+carpeta+"/LATEX_acc_s_e_ppv_npv_lrmes_lrmenys_PERCADAMODEL_.txt","w"); f1.write("\n\\toprule\nclassifier&accuracy&sensibility&specificity&ppv&npv&LR+&LR-\n\midrule\\\\\n") #fes la capsalera per a latex
      f2 = open("./"+carpeta+"/LATEX_brierScores_PERCADAMODEL_.txt","w"); f2.write("\\toprule\nBrier Scores & H.L. test\\\n\midrule\n") #fes la capsalera per a latex
      f4 = open("./"+carpeta+"/intervalsConfiansaIconfusionsMatrix.txt","w")
      f_readme = open("./"+carpeta+"/README_PREDICT_PROBAS.txt","w")
      f_readme.write("L'arxiu PREDICT_PROBA_NOMMODEL conte columnes separades per espais blanc segons:\n\n'PREDICCIO' --> (0 mci-nc, 1 mci-c)\n'LABEL'--> (idem a valor predit pero el valor que prove de ADNIMERGE, el valor real)\n'P_mci_nc'--> prob de pertanyer a grup 0 o mci-nc.\n'P_mci_c' --> probabilitat de pertanyer a grup MCI-c (1-p) o grup codificat amb 1.\n'Fold' --> el fold la crossvalidation don prove el subjecte.\n\nNoteu que aquest arxiu ajunta tots els folds de la cross validation i tindra tants subjectes\ncom subjectes hi hagi a la mostra. Cal tenir en compte que l'ordre es preserva perque CV.split() -on CV es objecte Stratified k-fold-\nens permet aleatoritzar sempre de la mateixa\nmanera, perque random_state=None per defecte.\nAixo permet que aquestes dades es puguin prendre per fer un multimodal approach de late integration.")
      f_readme.close()
      matriu_violinplot_sensibEsp = []
      ############################################################################################
      #ITEREM EN CADA CLASSIFICADOR. FEM 10 FOLD CROSS VALIDATION. FEM LES ROC CURVES PER CADA CLASSIFICADOR.
      ll_sens_espe_classificador = [] #per fer violinplot sens i espec
      for i in range(len(objectes_classificadors)):
      f3 = open("./fMRI_models_finals/PREDICT_PROBA_"+noms_classificadors[i].replace(" ","_")+".csv","w") #GUARDO PREDIT, GS,
      f3.write("PREDICCIO,LABEL,P_mci_nc,P_mci_c\n")
      clf, nom_clf = objectes_classificadors[i], noms_classificadors[i] #carrego l'objecte classificador i l'string que descriu cada classificador
      CV = StratifiedKFold(n_splits=10)
      #IMPLEMENT RECURSIVE FEATURE ELIMINATION [NO FUNCIONA]
      if rfecv:
      clf = RFECV(estimator=clf, step=1, cv=CV, scoring='accuracy')
      tprs, aucs = [], []; mean_fpr = np.linspace(0, 1, 100) #COSES DE LA PLANTILLA
      #creem llistes buides per acumular els true negatives, flase positives, false negatives i true positives per a despres fer una confusion matrix final.
      suma_tn, suma_fp, suma_fn, suma_tp = 0,0,0,0
      accuracies_mitjanes_mal_procediment = 0
      ll_acc_classificador = [] #creo llista buida
      j = 1 #per contar els folds
      #ITEREM EN ELS DIVERSOS FOLDS CREATS PER SKLEARN, DINS CADA CLASSIFICADOR
      y_certes, y_predites = [], [] #per a crear les brier scores per a cada classificador
      for train, test in CV.split(X, y): #CSEPARA ELS INDEX
      #OJO AMB ELS EFECTES ALIAS! PER AIXO LA COPIO...
      X_entrenament = np.copy(X[train])
      X_test = np.copy(X[test])
      if PCA_by_santi:
      estandaritza_variables = False #POSA TRUE SI VOLS ESTANDARITZAR LES VARIABLE SPER COLUMNES. POSA FALSE EN CAS CONTRARI
      dades = dades_acp(np.array(X_entrenament,dtype = np.float64), nombre_components = None, estandaritza_variables = estandaritza_variables, llista_variables=[])
      #dades.matriu_correlacions(grafic_correlacions = False) # ACP es basa en la matriu de corr. davant absencia de proves de esfericitat (bartlett,KMO) es importantissim mirar que la matriu de correlacions no s'apropa a la matriu identitat. Volem variables correlacionades o no te sentit aplicar PCA
      #dades.autovalors_i_variansssa(scree_plot=True) # veure autovalors, variança explicada i screeplot. Amb això (especialment amb l'screeplot), decideix quants components retenir combinant variança explicada i parsimònia.
      #dades.carregues_factorials() # veure les correlacions entre cada factor i cada variable
      dades.carregues_factorials()#vaig fer un castell de naipes i cal execitar aquesta funcio si o si, o la eguent no va.
      X_entrenament, ll_mitjana_train, ll_desvest_train, carregues_factorials_train = dades.puntuacions_factorials(imprimeix_dataset_reduit = False) # METODE QUE RETONRA EL DATAET REDUIT. ndarray reduit! em prenc les mitjanes i desviacions estandar estimades per a fer la estandaritzacio des de les dades d'entrenament
      if estandaritza_variables:
      X_test = ((X_test - ll_mitjana_train) / ll_desvest_train).dot(carregues_factorials_train) #apliquem column wise les operacions
      else:
      print(X_test.shape)
      print(carregues_factorials_train.shape)
      X_test = X_test.dot(carregues_factorials_train) # X_test * carregues_factorials_train --> https://stackoverflow.com/questions/24560298/python-numpy-valueerror-operands-could-not-be-broadcast-together-with-shapes
      print(X_test.shape)
      print("###########################")
      print("X_entrenament: ",X_entrenament.shape) #(57,ncomponents)
      print(len(ll_mitjana_train))#(22791,)
      print(len(ll_desvest_train)) #(22791,) CONTE DESVIACIONS TIPIQUES DUTES DE TRAIN
      print(carregues_factorials_train.shape) #(22791,components)
      print("X_test_post pca: ",X_test) #9,ncomp
      print("###########################")
      print("Ara entrenant: ",nom_clf)
      ##### CREO ESTADISTIQUES PER A CADA FOLD (DIAGNOSTIC ACCURACY) I LES VAIG GUARDANT PER CREAR LA CONFUSION MATRIX FINAL #########
      print("###### FOLD"+str(j)+" ######")
      print("#############################")
      ####################################################################
      clf.fit(X_entrenament, y[train]) #pas 1 --> entreno les dades, per a cada fold
      probas_ = clf.predict_proba(X_test) #pas 2 --> testejo el model a les dades d'entreno (TREU MATRIU PROBABILITATS [p, 1-p] per a cada fold)
      y_predits_test = np.argmax(probas_, axis=1)
      ###################################################################
      #DEFINEIXO LA CONFUSION MATRIX PER A CADA FOLD I IMPRIMEIXO ELS INDEXOS
      tn, fp, fn, tp = confusion_matrix(y[test], y_predits_test).ravel()
      accuracies_mitjanes_mal_procediment += (tp+tn)/(tp+tn+fp+fn)
      acc, s, e = (tp+tn)/(tp+tn+fp+fn), tp/(tp+fn), tn/(fp+tn)
      imprimeix_diagnostic_accuracy_a_cada_fold = True
      for k in range(len(probas_)):
      p, u_menys_p = probas_[k]
      predit, gold_standard = y_predits_test[k], y[test][k]
      f3.write(str(predit)+","+str(gold_standard)+","+str(p)+","+str(u_menys_p)+","+str(j)+'\n')#PREDICCIO,LABEL,P_mci_nc,P_mci_c,fold
      y_certes += [gold_standard]#per calcular brier scores
      y_predites += [predit]#per calcular brier scores
      if imprimeix_diagnostic_accuracy_a_cada_fold:
      print("probabilitats || y_predits_Test || y_labels_test")
      print(probas_,np.argmax(probas_,axis=1),y[test]) # FET PER MI_ IMPRIMEIXO PROBABILITATS, PROBABILITATS TRANSFORMADES A VALOR MAX (ARGMAX) I LABELS)
      print("\n#######################")
      print("Accuracy: {:.5f}".format(acc))
      print("Sensibilitat: {:.5f}".format(s))
      print("Especificitat: {:.5f}".format(e))
      print("Valor predictiu + (PPV): {:.5f}".format(tp/(tp+fp))) #tambe anomenat precision score. hi ha una funcio a sklearn que ho fa
      print("Valor predictiu - (NPV): {:.5f}".format(tn/(fn+tn)))
      print("LR+: {:.5f}".format(s / (1 - e)))
      print("LR-: {:.5f}".format((1 - s)/e))
      print("#######################")
      print("tn,fp,fn,tp: ",tn,fp,fn,tp)
      print("#######################\n")
      if fes_violin_plot_accuracies_per_folds_i_classificador:
      ll_acc_classificador += [acc]
      ll_sens_espe_classificador += [[noms_classificadors[i],"sensitivity",s]] #faig llista [[clsf[i], s1, e1],[[clsf[i], s2, e2],...,[clsf[i], sk, ek]] on k es el nombre de folds. Ha de tenir dimensio (50,3) --> 10 folds * 5 CLASSIFICADORS * 2 proporcions per fold (una sens i una esp.) = 100 files | 3 variables (COMPROVAT)
      ll_sens_espe_classificador += [[noms_classificadors[i],"specificity",e]]
      #VAIG ACUMULANT ELS VALORS PER A CADA FOLD PER TAL DE TENIR UNA CONFUSION MATRIX FINAL, DESPRÉS DE LA VALIDACIÓ CREUADA
      suma_tn += tn
      suma_fp += fp
      suma_fn += fn
      suma_tp += tp
      # Compute ROC curve and area the curve
      fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1]) #PER QUE DIMONIS TREU DIVERSOS INDEXOS DE FPR i TPR PER A CADA FOLD? SI NOMÉS N'HI HA D'HAVER UN...
      tprs.append(interp(mean_fpr, fpr, tpr))
      tprs[-1][0] = 0.0
      roc_auc = auc(fpr, tpr)
      aucs.append(roc_auc)
      plt.plot(fpr, tpr, lw=1, alpha=0.3,
      label='ROC fold %d (AUC = %0.2f)' % (j, roc_auc))
      j = j + 1
      f3.close()
      if fes_violin_plot_accuracies_per_folds_i_classificador:
      dic_acc_per_fold_i_classificador[noms_classificadors[i]] = [] #CREO LES CLAUS DEL DICCIONARI, CAL INICIALITZAR-LES PRIMER RECORDA, SINO DONA ERROR
      dic_acc_per_fold_i_classificador[noms_classificadors[i]] = ll_acc_classificador #afegeixo al diccionari una llista amb les accuracies per fold
      #print("######################################");print("######################################",file=f4)
      #print("Accuracy mitjana (mitjana 10 accuracies): {:.4f}".format(accuracies_mitjanes_mal_procediment/10));print("Accuracy mitjana (mitjana 10 accuracies): {:.4f}".format(accuracies_mitjanes_mal_procediment/10), file=f4)
      print("\n\n##################################################");print("\n\n\n\n##################################################",file=f4)
      print("RESULTAT DESPRES DE SUMAR ELS FP,TN,TP,FN DELS 10 FOLDs\n ("+nom_clf+")");print("RESULTAT DESPRES DE SUMAR ELS FP,TN,TP,FN DELS 10 FOLDs\n ("+nom_clf+")",file=f4)
      print("####################################################"); print("####################################################", file=f4)
      brier_score = brier_score_loss(y_certes, y_predites)
      accuracy, s, e, ppv, npv = (suma_tp+suma_tn)/(suma_tp+suma_tn+suma_fp+suma_fn), suma_tp/(suma_tp+suma_fn), suma_tn/(suma_fp+suma_tn), suma_tp/(suma_tp+suma_fp), suma_tn/(suma_fn+suma_tn)
      LRmes, LRmenys = s/(1 - e), (1 - s)/e
      #faig intervals de confiansssa!
      print("\nAccuracy: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(accuracy, IC95(accuracy,n_samples)[0],IC95(accuracy,n_samples)[1]));print("\nAccuracy: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(accuracy, IC95(accuracy,n_samples)[0],IC95(accuracy,n_samples)[1]),file=f4)
      print("Sensibilitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(s,IC95(s,n_samples)[0],IC95(s,n_samples)[1]));print("Sensibilitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(s,IC95(s,n_samples)[0],IC95(s,n_samples)[1]), file=f4)
      print("Especificitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(e,IC95(e,n_samples)[0],IC95(e,n_samples)[1]));print("Especificitat: {:.5f} (IC95 % de {:.2f} A {:.2f})".format(e,IC95(e,n_samples)[0],IC95(e,n_samples)[1]),file=f4)
      print("Valor pred + (PPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(ppv,IC95(ppv,n_samples)[0],IC95(ppv,n_samples)[1]));print("Valor pred + (PPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(ppv,IC95(ppv,n_samples)[0],IC95(ppv,n_samples)[1]), file=f4) #tambe anomenat precision score. hi ha una funcio a sklearn que ho fa#tambe anomenat precision score. hi ha una funcio a sklearn que ho fa
      print("Valor pred - (NPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(npv,IC95(npv,n_samples)[0],IC95(npv,n_samples)[1])); print("Valor pred - (NPV): {:.5f} (IC95 % de {:.2f} A {:.2f})".format(npv,IC95(npv,n_samples)[0],IC95(npv,n_samples)[1]),file=f4)
      print("LR+: {:.5f}".format(LRmes));print("LR+: {:.5f}".format(LRmes),file=f4)
      print("LR-: {:.5f}".format(LRmenys));print("LR-: {:.5f}".format(LRmenys),file=f4)
      print("Brier score final: {:.3f} ".format(brier_score));print("Brier score final: {:.3f} ".format(brier_score),file=f4) #calibracio & performance del model
      print("###############C O N F U S I O N M A T R I X##############");print("###############C O N F U S I O N M A T R I X##############",file=f4)
      print(" alzh no alzh");print(" alzh no alzh",file=f4)
      print("sumatoris (tp,fp) | +test {:2} {:2}".format(suma_tp, suma_fp)); print("sumatoris (tp,fp) | +test {:2} {:2}".format(suma_tp, suma_fp),file=f4)
      print(" (fn,tn) | -test {:2} {:2}".format(suma_fn,suma_tn));print(" (fn,tn) | -test {:2} {:2}".format(suma_fn,suma_tn),file=f4)
      print("############################################################\n");print("############################################################\n", file=f4)
      #guardo les confusion matrix per fer un heatmap despres
      os.chdir("."+os.sep+carpeta)
      np.save("confusion_matrix__"+noms_classificadors[i], np.array([[suma_tp, suma_fp],[suma_fn, suma_tn]]))
      os.chdir("../")
      #GUARDO AL FITXER DE RESUM PER CLASSIFICADOR (QUE CREA EL TXT QUE ANIRA A LATEX A LA TAULA DE ACCURACIES)
      #accuracy&sensibility&specificity&ppv&npv&LR+&LR str(
      f1.write(noms_classificadors[i]+" & {:.2f} & {:.2f} & {:.2f} & {:.2f} & {:.2f} & {:.3f} & {:.3f}\\\\\n".format(accuracy*100, s*100, e*100, ppv*100, npv*100, LRmes, LRmenys))
      f2.write("{:.3f}".format(brier_score)+"& HOSMER SIMPSON\\\\\n") #https://www.data-essential.com/hosman-lemeshow-in-python/
      plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
      label='Luck', alpha=.8)
      mean_tpr = np.mean(tprs, axis=0)
      mean_tpr[-1] = 1.0
      mean_auc = auc(mean_fpr, mean_tpr)
      std_auc = np.std(aucs)
      plt.plot(mean_fpr, mean_tpr, color='b',
      label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
      lw=2, alpha=.8)
      std_tpr = np.std(tprs, axis=0)
      print("\n######################################");print("\n######################################",file=f4)
      #print("mean_tpr: ",mean_tpr,"std_tpr",std_tpr)
      print("mean_auc:",mean_auc, "std_auc",std_auc);print("mean_auc:",mean_auc, "std_auc",std_auc, file=f4)
      print("######################################\n");print("######################################",file=f4)
      tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
      tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
      plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
      label=r'$\pm$ 1 std. dev.')
      plt.xlim([-0.05, 1.05])
      plt.ylim([-0.05, 1.05])
      plt.xlabel('False Positive Rate (1-especificity)')
      plt.ylabel('True Positive Rate (sensitivity)')
      plt.title(nom_clf)
      plt.legend(loc="lower right")
      plt.show()
      f1.write("\\bottomrule")
      f2.write("\\bottomrule")
      f4.close()
      f1.close()
      f2.close()
      #UN COP HA ENTRENAT TOTS ELS MODELS CREO EL DATAFRAME D'ACCURACIES PER CLASSIFICADOR I EL PLOTEJO AMB SEABORN
      df_acc_per_classificador = pd.DataFrame(dic_acc_per_fold_i_classificador) #com a claus dels diccionaris hi ha els noms dels classificadors ja definits abans en la llista del principi
      sns.set(style="whitegrid")
      ax = sns.violinplot(data=df_acc_per_classificador, palette="muted", order=["Multilayer perceptron (ANN)","Logistic Regression","Support Vector Machines (linear kernel)", "Nearest Neighbours","Gaussian Naive Bayes"])
      plt.ylabel("Accuracy")
      plt.xlabel("Machine learning model (classifier)")
      #plt.ylim(0,1)
      plt.show()
      #FAIG ARA EL MATEIX AMB LES SENSIBILITATS I ESPECIFITATS PER CLASSIFICADOR
      plt.close()
      sns.set(style="whitegrid") #strin del model, si es Sensitivity o specificity, mesura (proporcio)
      df_sensEsp_per_classificador = pd.DataFrame(columns=["model","metric","p"], data=ll_sens_espe_classificador)
      ax = sns.violinplot(x="model", y="p", hue="metric",data=df_sensEsp_per_classificador, palette="muted",order=["Multilayer perceptron (ANN)","Logistic Regression","Support Vector Machines (linear kernel)", "Nearest Neighbours","Gaussian Naive Bayes"])
      plt.ylabel("sensibility/specificity")
      plt.xlabel("Machine learning model (classifier)")
      #plt.ylim(0,1)
      plt.show()

    • aux__ploteja_matrius_correlacions.py

    • aux__reduccio_dimensions.py: This file was done in order to implement a graphical interpretation for the explained variance of a principal components analysis (i.e it plots the eigenvalues of each component for the previously chosen number of components). The scree plot is a graphical way of seeing whether each component of a PCA adds value to what we are trying to model in terms of parsimony. Softwares such as SPSS or R have the possibility of getting a Scree plot easily, but python doesn't (at least, not while I was doing this thesis). So I coded this program to get the scree plot.

The packages I've used in this thesis are sklearn for machine learning, matplotlib and seaborn for graph generation and visualization, pandas for data analysis, numpy for treating the multidimensional arrays that came out of the NIFTI files that came out of the fMRIs.

Footnotes

  1. The ADNI(Alzheimer's disease neuroimaging iniciative) gathered together high quality, free to use data on the alzheimer's disease, an without this data this thesis wouldn't have been possible.

  2. Lots of files are missing as github has limited space, so this is a simplification of the directories of my final thesis, not an exhaustive tour to it!

  3. NIFTI files are not here, as they occupy several Gigabytes of data. This is just for showcasing purposes.

  4. Region of Interest (ROI).

  5. (93,214,140) --> (SUBJECTS, ROIs,TIME SERIES)

  6. In my final thesis I took subjects that were at risk of developing Alzheimer's disease but didn't have the disease (these are what are called Mild Cognitive Impairment or MCI). At the moment of diagnosis as MCI, they had to undergo a brain fMRI scan. Then time went on and after some years some of them became Alzheimer's disease (MCI-c), and some didn't (MCI-nc). The ones who I labeled as MCI-c are clear to label, but the ones who are MCI-nc are not clear (because there might be not enough longitudinal data because the patient might die, get discontinued from the study). So it was important to choose a criterion to decide how many years of follow up are enough for a patient to be considered free of Alzheimer's.

  7. The actual code got 74 subjects, as changes in the inclusion criteria (MCI-nc cutoff time point) varied.

About

The most important programming files, code functions and data processing pipelines for the Machine learning final thesis of my Master's degree. Also, the LaTeX code of the thesis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published