n2p2ReductionData - some scripts to reduce structure database using several machine learning approaches
This repository provides several script (python + bash) to reduce the database of structures used to train NN network potential with n2p2 package. Requirement : n2p2 package and scikit-learn library.
build a .h5 file using function.data contenning G function produced by nnp-scaling program of n2p2
- Input file : function.data (default), required. You can used --infile=otherfile to change the name of input file
- Output file : functions.h5 (default). You can change the name using --outfile=otherfile to change the name of output file
Search list of selected structures based on KMeans, DBSCAN or HDBSCAN clustering method.
- Input file : functions.h5 (default), required. You can used --infile=otherfile to change the name of input file
- Output file : numStructs.csv (default). You can used --outfile=otherfile to change the name of output file
- Second output file : clusters.csv (default). Set --outclustersfile=othername.csv to change it. This file contains strcutures number, Z and clusters number.
- The default clustering approach is KMeans. You can change it using --method=DBSCAN, --method=HDBSCAN or --method=None (1 cluster by z : all data for one z are in one cluster)
- The default number of clusters for KMeans = 5. To change it, use : --k=numberOfClusters
- The default minsample hperparameter is 2. You can change it by --minsample=OtherValue
- The percentage of selected structures by cluster is set to 0.20%. To change it : --p=newValue. Please note that for DBSCAN and HDBSCAN, all outliers structures are selected. If newValue<0, int(-newValue) rows are selected by z (no %)
- The optimal value of eps hyperparameter of DBSCAN is computed using NearestNeighbors+Knee method. You can change it using --eps=FixedPositiveValue
- By default no reduction of dimension. If needed, add: --reddim=PCA
- To define the dimension after reduction, set kr. --kr=integer or a real real between 0 and 1.0 (see scikit-learn documentaion for PCA)
- By defeault the data are not scaled. To do it, use : --scaling=MinMax, Standard or MaxAbs
- By default all data (rows) are used. To reduce data, use --reddata=MaxG or --reddata=StdG. In this case, We search the G column with max (MaxG or StdG) value. The data are sorted using this column and the data are reduced to kdeddat, taking rows with linear step
- Set the data size using --kreddata=integervalue
- The seed=111 by default. You can change it by : --seed=OtherInteger
Search list of selected structures based on G values on grid.
- Input file : functions.h5 (default), required. You can used --infile=otherfile to change the name of input file
- Outut file : numStructs.csv (default). You can used --outfile=otherfile to change the name of output file
- The default method to make a reduction of dimensions is PCA. You can change it using --method=TSNE or --method=None (without reduction)
- By default no scaling on G. You can change it by --scaling=Standard, --scaling=MinMax, --scaling=AbsMax, or --scaling=None (default)
- The number of dimensions after reduction is 1. To change it, use --k=value, where value represents the number of dimensions for t-SNE (from 1 to 3) or PCA. For PCA, k can be a real number between 0 and 1.0. In this case, the number of dimensions is computed automatically based on the amount of variance that needs to be explained, which is greater than the percentage specified by n_components (see scikit-learn documentation)
- By default we select randomly one structure from one grid. Other option --minmax=1 => select from one grid the nearest structure to box borders. 2=> select one value by grid, that near xmin or xmax using normalized distance (x from xmin to xmax=>x from 0 to 1), 3=> same 2 but we use sum of min distances to select the structure.
- The percentage used to select number of grid points is set to 0.20%. To change it : --p=newValue. the number of grid points m = int((number of dataset/100*percentage)**(1.0/n_components). If p<0 : m=int(-p) for each direction
Compute KDE for each structure of the database
- Input file (required) : functions.h5 (default), required. You can used ---databasefile=otherfile to change the name of input file. The data of this file are used to fit KDE.
- Second input file (optional) : None (default), not required. Change it using : --descfile=othername h5 file. Using G from this file, we compute KDE for structures given in this same file. If None, we compute KDE for structures given in databasefile input file.
- By default we search max KDE using data in databasefile. If not, we used those given in maxfile : --maxfile=namecontainingmaxKDE
- Outut file : resultsKDE.csv (default). You can used --outfile=otherfile to change the name of output file
- By default no reduction of dimension. If needed, add: --reddim=PCA
- To define the dimension after reduction, set k. --k=integer or a real real between 0 and 1.0 (see scikitlearn documentaion for PCA)
- By defeault the data are not scaled. To do it, use : --scaling=MinMax, Standard or MaxAbs
- The seed=111 by default. You can change it by : --seed=OtherInteger
- scott is the default method for bandwidth in KDE. You can change it using --bw=othermethod (see KernelDensity documentation in scikit-learn)
- 0 is the default value for rtol. To change it, use --rtol=otherpositivevalue (see KernelDensity documentation in scikit-learn)
- 40 is the default value of leaf_size. To change it, --leaf_size=otherinteger (see KernelDensity documentation in scikit-learn)
build a numStructs.csv using clusters.csv produced by Clustering.py.
- Input file : clusters.csv (default), required. You can used --infile=otherfile to change the name of input file
- Outut file : numStructs.csv (default). You can used --outfile=otherfile to change the name of output file
- The percentage of selected structures by cluster is set to 0.20%. To change it : --p=newValue. If newValue<0, int(-newValue) rows are selected by z (no %)
- The seed=111 by default. You can change it by : --seed=OtherInteger
build a numStructs.csv using resultsKDE.csv produced by KDE.py.
- Input file : resultsKDE.csv (default), required. You can used --infile=otherfile to change the name of input file
- Input file : numStructs.csv (default), required. You can used --numfile=othernumfile to change the name of numfile
- The selection method can be Regular, Logarithmic, Smallest KDE. Use --method=yourmethod to set it
- The percentage of selected structures by cluster is set to 10.0%. To change it : --p=newValue
build a selInput.data file using input.data (the database for nnp-train) and numStructs.csv
- Input file : input.data (default), required. You can used --infile=otherfile to change the name of input file
- Input file : numStructs.csv (default), required. You can used --numfile=othernumfile to change the name of numfile
- Outut file : selInput.data (default). You can use --outfile=otherfile to change the name of output file
To reduce the database, you have to run, in this order :
python buildGh5.py
python Clustering.py
# and probably python buildListFromClustersFile.py
python buildSelectedData.py
Or
python buildGh5.py
python SelectionOnGrid.py
python buildSelectedData.py
Or
python buildGh5.py
python KDE.py
python SelectionByKDE.py
python buildSelectedData.py
As examples, see xAllKMeans, xAllDBSCAN, xAllSelOnGrid, xAllKDE and xAllHDBSCAN in testH2O folder
build several metrcis (min, max, mean, Chi2, ....) of several .h5 data files.
- Input files : funct*h5 (default), list of files .h5 to read, containing G values fro several database. You can change it by : --infiles=list of files. Example : --infiles=file1.h5,file2.h5,fileT*.h5
- Output files : metrics (default), required. You can used --outfile=otherprefix. outfile is used as prefix of outfiles : one .h file for all Z and one .csv for each Z, datasorted by Chi2, ...
- bins : default 100. To change it, --bins =newNumberOfBins
- To compute the Chi2 on N dimensions (N=number of G), we build a histogram with nbins = int(-p) for each dimension if p<0 (default -10). If p>0 total number of bins = int(total number of G*p/100)
- Abdulrahman Allouche (Lyon 1 University)
This software is licensed under the GNU General Public License version 3 or any later version (GPL-3.0-or-later).