The following code accompanies the manuscript "Application of a Hermite-based measure of non-Gaussianity to normality tests and Independent Component Analysis" by authors Parul Jain, Bruce W. Knight Jr., and Jonathan D. Victor.
The code was created and run on MATLAB 2018b.
After downloading and extracting the code, run file setupPath.m in the main folder to set up the appropriate paths.
The package contains several folders differentiated by their function. Importantly, folder Hermite_MoNG contains all functions needed for using the Hermite-based normality test. One can estimate non-Gaussianity using Hermite-based measure by calling the function estimateHFPower, which takes two arguments: a row vector of the one-dimensional dataset and the array of orders of Hermite functions. For J15 defined in the manuscript, the second argument will be 0:15. The code to use the Hermite-based measure of non-Gaussianity for ICA is contained in Hermite_MoNG and SwarmFastICA. To use the Hermite-based measure for ICA, run fastica with the Hermite-based measure as the contrast function (feed argument 'g' as 'hfpowl') and supply a list of the order of Hermite functions to use (feed argument 'nList' as 0:15 for J15).
The accompanying code was written by Parul Jain, who can be contacted at jainparul321@gmail.com.
The flow of the package is defined below, along with the folders/files to use. Instructions for recreating manuscript figures are included at the end.
Simulated 1D, 2D, and multidimensional D datasets with different distributions analyzed in the manuscript were generated using code provided in the folder generateData. Files named according to the family of distributions generate 1D distributions. Multidimensional datasets can be generated using the file generateDataND, which takes in an N-D structure containing distribution parameters for each signal and the number of samples. Thirty simulated EEG datasets used in the manuscript were generated using simBCI (Lindgren et al., 2018). Five datasets are provided in the folder datasets. Due to size constraints, others could not be shared via GitHub but are available on request.
The folder demoScripts contains scripts with parameters used in the manuscript. Alternatively, the folder interactiveScripts contains code that prompts the user to input parameters. If the parameters are the same as demo scripts, they will generate the same results as demoscripts. Files provided in demoScripts and interactiveScripts can compare for multiple sample sizes or distribution shapes. They call functions in folder masterScripts to run for each set of parameters, saving results in a new folder named Results on the current path or the path provided by the user. Folders normalityTests and FastICAContrastFunctions contain the functions for standard methods compared in the manuscript.
Results need to be compiled across files to analyze the exploratory analysis in 2D datasets and the 5D and simulated EEG datasets. The manuscript's statistics and figures can be generated using files in analysisScripts. To combine results across files for exploratory 2D datasets, use curate2DExploratory.m. This script also bootstraps angular variance. For multidimensional datasets and simulated EEG datasets, use selectICASignals.m for selecting relevant components, automatically or manually.
These scripts are provided in analysisScripts. Functions to visualize results for 1D (Figures 1, 2) and 2D (Figure 3) require selecting a file generated by a demo/interactive script. To visualize results for exploration in 2D, select the file generated by compiling results.For multidimensional, or simulated EEG datasets, select the folder containing the files with selected components. For the multidimensional and simulated EEG dataset, functions to perform ANOVA and paired statistics, as done in the manuscript, are also provided.
Other supporting functions are included in the folder utility.
Start in the main folder. Create a folder (likely in the main directory) where you would like to save the results of the runs. The rest of the document will refer to this folder as the resultFolder.
Note: By default, these routines are set to run with parameters used in the paper and take several hours or even days (4 hours for 1D datasets to several days for simulated EEG datasets). For shorter run times but more limited statistics, the user can run faster demo scripts provided in the demoScripts folder. These scripts use smaller values of certain parameters finish faster and the details of these parameters are provided in the relevant sections.
-
The paper reports the results for sample sizes 1,000, 10,000, and 100,000, with 1,000 simulations for each sample size. Script demo1D generates these results in a few hours. The analysis is faster for smaller sample sizes with fewer simulations. Results for sample sizes 250 and 500, with 10 simulations for each sample size, can be generated by running fasterDemo1D.m, which takes about 2 minutes.
-
A prompt will ask if you want to select a folder to save results. If you wish to save results in a particular folder, answer 'y' or 'Y' and select a folder from the dialog box. All other inputs, including blank, will create a folder named Results in the current directory.
fasterDemo1D.m sets the relevant parameters and then calls demo1D.m.
The scripts will create a folder named for each distribution on this path. Results will be saved in corresponding folders as [distribution name]_1D_N[sample size]_S[simulations]_HF[maximum order of Hermite functions used].mat.
Also, demo1D.m will call plot1D.m internally to create one figure for each distribution and sample size, each with two panels comparing the new measure to standard normality tests and FastICA contrast functions.
-
The paper reports the results for sample sizes 1,000, 10,000, and 100,000, with 1,000 simulations for each sample size. To find maximum non-Gaussianity for a dataset, 2D data is projected on axes at different orientations defined by their angle with the abscissa. The results reported in the paper are for the angles [-90:-11 -10:0.5:-2 -1.9:0.1:1.9 2:0.5:10 11:90]. Script demo2D generates these results and can take many hours to finish. The analysis is faster for smaller sample sizes, fewer simulations, and a smaller set of angles to project. Results for sample sizes 250 and 500, with 10 simulations for each sample size, projected on the set of angles given by [-90:10:90], can be generated by running fasterDemo2D.m, which takes about 2 minutes.
-
A prompt will ask if you want to select a folder to save results. If you wish to save results in a particular folder, answer 'y' or 'Y' and select a folder from the dialog box. All other inputs, including blank, will create a folder named Results in the current directory.
fasterDemo2D.m sets the relevant parameters and then calls demo2D.m.
The scripts will create a folder named for each distribution on this path. Results will be saved in corresponding folders as [distribution name]_2D_N[sample size]_S[simulations]_HF[maximum order of Hermite functions used].mat.
Also, demo2D.m will call plot2D.m internally to create one figure for each distribution and sample size, comparing the new measure to FastICA contrast functions for the 2D distribution.
- The paper reports the results for sample sizes 1,000, 10,000, and 100,000, with 1,000 simulations for each sample size. Non-Gaussianity is estimated using different sets of Hermite functions that vary in the nature of sets (using all orders, only even orders, and only odd orders) and the size of the sets (the largest order of a set varies from 1 to 50). For each set of Hermite functions, maximum non-Gaussianity is found by projecting the data on axes at different orientations defined by their angle with the abscissa. The results reported are for the angles [-90:-11 -10:0.5:-2 -1.9:0.1:1.9 2:0.5:10 11:90]. Finally, circular variance is estimated by bootstrapping simulation results with the same parameters 1000 times, each with 1000 values. Script demo2DExploratory generates these results and takes many days to finish. The analysis is faster for lower values of the parameters. Results for sample sizes 250 and 500, with 10 simulations for each sample size, the size of sets limited to maximum order 2-4, and 10 bootstraps of 10 values each can be generated by running fasterDemo2DExploratory.m, which takes about 20 minutes.
- A prompt will ask if you want to select a folder to save results. If you wish to save results in a particular folder, answer 'y' or 'Y' and select a folder from the dialog box. All other inputs, including blank, will create a folder named Results in the current directory.
faster2DExploratory.m sets the relevant parameters and then calls demo2DExploratory.m.
The scripts will create a folder named for each distribution on this path. Results will be saved in corresponding folders as [distribution name]_2D_exploratory_N[sample size]_S[simulations]HF[maximum order of Hermite functions used][set structure].mat, where a set structure is "all", "even", or "odd", and defines the set of Hermite functions used.
demo2DExploratory.m internally calls curate2DExploratory.m to combine results across sample sizes and the number of Hermite functions for 'all,' 'even,' and 'odd' cases separately.
Also, demo2DExploratory.m will call plot2DExploratory.m internally to create a figure similar to Figure 4, with one panel per distribution.
- The paper reports the results for sample sizes 1,000, 10,000, and 100,000, with 30 datasets simulated for each sample size. For each dataset, 5 independent components were extracted, with 25 random starts per component. From these, the most non-Gaussian component was automatically selected by selectICASignals.m. Script demoND.m generates these results and takes multiple days to finish. The analysis is faster for lower values of the parameters. Results for sample sizes 250 and 500, with 5 datasets for each sample size, 2 independent components, and 5 random starts can be generated by running fasterDemoND.m, which takes about 10 minutes.
- A prompt will ask if you want to select a folder to save results. If you wish to save results in a particular folder, answer 'y' or 'Y' and select a folder from the dialog box. All other inputs, including blank, will create a folder named Results in the current directory.
fasterND.m sets the relevant parameters and then calls demoND.m.
Results will be saved in corresponding folders as [distribution name]_5D_N[sample size]_HF[maximum order of Hermite functions used]_RS[number of random starts]_D[dataset ID].mat.
demoND.m internally calls selectICASignals.m to automatically select the component with the highest value of non-Gaussianity according to each contrast function. For each file, the results are saved as [distribution name]_5D_N[sample size]_HF[maximum order of Hermite functions used]_RS[number of random starts]_D[dataset ID]_performance.mat.
Also, demoND.m will call plotErrorICA.m and pairedStatsICAinternally to create two figures per distribution, respectively: one figure with a panel per sample size and another figure combining results across sample sizes. These panels are comparable to the panels in Figure 5.
- The paper reports the results for sample sizes 1502, 3752, 7502, 15002, 37502, and 75002, with 30 simulated datasets for each sample size. For each dataset, ICA is performed on the first 50 principal components, and 20 independent components were extracted, with 25 random starts per independent component. From the returned 20 components, 2 components were manually selected. demoEEG.m generates these results for all but the larget sample size (75002) and can take many days to finish. The analysis is faster for lower values of the parameters. Results for sample sizes 1500 and 2000, with 5 datasets for each sample size, 20 principal components, 5 independent components, and 5 random starts can be generated by running fasterDemoEEG.m, and takes about 10 minutes.
- Ten simulated EEG datasets are provided in the folder datasets. If you are in the main folder, it will automatically choose the folder named datasets. If not, you will be prompted to select this folder.
- A prompt will ask if you want to select a folder to save results. If you wish to save results in a particular folder, answer 'y' or 'Y' and select a folder from the dialog box. All other inputs, including blank, will create a folder named Results in the current directory.
fasterDemoEEG sets the relevant parameters and then calls demoEEG.m.
The scripts will create a folder named simulated_eeg on this path. Results will be saved in this folder as simulated_eeg_N[sample size]_HF[maximum order of Hermite functions used]_RS[number of random starts]_D[dataset ID].mat.
Script demoEEG.m runs selectICASignals.m to select components for each dataset.
-
You will be prompted to choose if you want to select two components corresponding to ocular artifacts manually ('y'/'Y') or automatically ('N'/'n'). If you choose to do it manually, two additional prompts will assist in dividing this time-consuming step into batches. First, you will be asked if you would like to select files. You may respond 'y' or 'Y' and select files. All other inputs, including blank, will select all files in the folder. You may interrupt the script at any time due to an error or time constraint and address it via the next prompt. The second prompt asks if you want to reselect components for processed files. If you are interrupting for either reason, saving the workspace and reloading it before the next steps are recommended to load user-made parameter changes. If you interrupted the program due to a mistake that ended in a saved file, you may select the dataset used for the file in the file selection prompt and choose to reselect components for it by entering 'y' or 'Y' here. If you interrupt the program to take a break, input other than 'y' or 'Y' at this prompt will resume where you left off.
-
If you select automatically, the two components that maximize non-Gaussianness will be chosen for further consideration. If you do it manually, for each dataset and each contrast function, a figure will show all independent components, with the highest non-Gaussianity components highlighted in red boxes. The user will be asked if they would like to change the selection. Answer 'y' or 'Y' to change the selection. All other inputs, including blank, will proceed with the highlighted selection. If the user chooses to change components, they will be prompted to enter an array of two values corresponding to the array's indices, which will be highlighted. The user will be asked again if they are satisfied. Thus, the user can change the components any number of times until satisfied. Results will be saved for each dataset as simulated_eeg_N[sample size]_HF[maximum order of Hermite functions used]_RS[number of random starts]_D[dataset ID]_performance.mat.
For the manuscript, we selected components with structured noise. If multiple components with structured noise exist, those with the highest noise magnitude were selected. For this purpose, the scale of all components is made uniform. When using fewer random starts and IC, there may be cases with none or only one of the components extracted. Any component(s) may be selected for the missing component(s) as the results consider this by comparing the components optimizing the contrast functions and not the contrast function used to obtain those components (see Methods of Jain et al. 2023 for details).
fasterDemoEEG sets the relevant parameters and then calls demoEEG.m.
demoEEG also calls plotErrorICA.m, which generates a figure with a panel per sample size. These panels are comparable to those in Figure 6. Script pairedICAStats.m is also called internally, which combines results across sample sizes and generates the right-most column of Figure 6. If you interrupt the file, you can run them by calling plotErrorICA.m and pairedICAStats.m. Please make sure to load the parameters before running these scripts.