n2p2ReductionData - some scripts to reduce structure database using several machine learning approaches

This repository provides several script (python + bash) to reduce the database of structures used to train NN network potential with n2p2 package. Requirement : n2p2 package and scikit-learn library.

buildGh5.py

build a .h5 file using function.data contenning G function produced by nnp-scaling program of n2p2

Input file : function.data (default), required. You can used --infile=otherfile to change the name of input file
Output file : functions.h5 (default). You can change the name using --outfile=otherfile to change the name of output file

Clustering.py

Search list of selected structures based on KMeans, DBSCAN or HDBSCAN clustering method.

Input file : functions.h5 (default), required. You can used --infile=otherfile to change the name of input file
Output file : numStructs.csv (default). You can used --outfile=otherfile to change the name of output file
Second output file : clusters.csv (default). Set --outclustersfile=othername.csv to change it. This file contains strcutures number, Z and clusters number.
The default clustering approach is KMeans. You can change it using --method=DBSCAN, --method=HDBSCAN or --method=None (1 cluster by z : all data for one z are in one cluster)
The default number of clusters for KMeans = 5. To change it, use : --k=numberOfClusters
The default minsample hperparameter is 2. You can change it by --minsample=OtherValue
The percentage of selected structures by cluster is set to 0.20%. To change it : --p=newValue. Please note that for DBSCAN and HDBSCAN, all outliers structures are selected. If newValue<0, int(-newValue) rows are selected by z (no %)
The optimal value of eps hyperparameter of DBSCAN is computed using NearestNeighbors+Knee method. You can change it using --eps=FixedPositiveValue
By default no reduction of dimension. If needed, add: --reddim=PCA
To define the dimension after reduction, set kr. --kr=integer or a real real between 0 and 1.0 (see scikit-learn documentaion for PCA)
By defeault the data are not scaled. To do it, use : --scaling=MinMax, Standard or MaxAbs
By default all data (rows) are used. To reduce data, use --reddata=MaxG or --reddata=StdG. In this case, We search the G column with max (MaxG or StdG) value. The data are sorted using this column and the data are reduced to kdeddat, taking rows with linear step
Set the data size using --kreddata=integervalue
The seed=111 by default. You can change it by : --seed=OtherInteger

SelectionOnGrid.py

Search list of selected structures based on G values on grid.

Input file : functions.h5 (default), required. You can used --infile=otherfile to change the name of input file
Outut file : numStructs.csv (default). You can used --outfile=otherfile to change the name of output file
The default method to make a reduction of dimensions is PCA. You can change it using --method=TSNE or --method=None (without reduction)
By default no scaling on G. You can change it by --scaling=Standard, --scaling=MinMax, --scaling=AbsMax, or --scaling=None (default)
The number of dimensions after reduction is 1. To change it, use --k=value, where value represents the number of dimensions for t-SNE (from 1 to 3) or PCA. For PCA, k can be a real number between 0 and 1.0. In this case, the number of dimensions is computed automatically based on the amount of variance that needs to be explained, which is greater than the percentage specified by n_components (see scikit-learn documentation)
By default we select randomly one structure from one grid. Other option --minmax=1 => select from one grid the nearest structure to box borders. 2=> select one value by grid, that near xmin or xmax using normalized distance (x from xmin to xmax=>x from 0 to 1), 3=> same 2 but we use sum of min distances to select the structure.
The percentage used to select number of grid points is set to 0.20%. To change it : --p=newValue. the number of grid points m = int((number of dataset/100*percentage)**(1.0/n_components). If p<0 : m=int(-p) for each direction

KDE.py

Compute KDE for each structure of the database

Input file (required) : functions.h5 (default), required. You can used ---databasefile=otherfile to change the name of input file. The data of this file are used to fit KDE.
Second input file (optional) : None (default), not required. Change it using : --descfile=othername h5 file. Using G from this file, we compute KDE for structures given in this same file. If None, we compute KDE for structures given in databasefile input file.
By default we search max KDE using data in databasefile. If not, we used those given in maxfile : --maxfile=namecontainingmaxKDE
Outut file : resultsKDE.csv (default). You can used --outfile=otherfile to change the name of output file
By default no reduction of dimension. If needed, add: --reddim=PCA
To define the dimension after reduction, set k. --k=integer or a real real between 0 and 1.0 (see scikitlearn documentaion for PCA)
By defeault the data are not scaled. To do it, use : --scaling=MinMax, Standard or MaxAbs
The seed=111 by default. You can change it by : --seed=OtherInteger
scott is the default method for bandwidth in KDE. You can change it using --bw=othermethod (see KernelDensity documentation in scikit-learn)
0 is the default value for rtol. To change it, use --rtol=otherpositivevalue (see KernelDensity documentation in scikit-learn)
40 is the default value of leaf_size. To change it, --leaf_size=otherinteger (see KernelDensity documentation in scikit-learn)

buildListFromClustersFile.py

build a numStructs.csv using clusters.csv produced by Clustering.py.

Input file : clusters.csv (default), required. You can used --infile=otherfile to change the name of input file
Outut file : numStructs.csv (default). You can used --outfile=otherfile to change the name of output file
The percentage of selected structures by cluster is set to 0.20%. To change it : --p=newValue. If newValue<0, int(-newValue) rows are selected by z (no %)
The seed=111 by default. You can change it by : --seed=OtherInteger

SelectionByKDE.py

build a numStructs.csv using resultsKDE.csv produced by KDE.py.

Input file : resultsKDE.csv (default), required. You can used --infile=otherfile to change the name of input file
Input file : numStructs.csv (default), required. You can used --numfile=othernumfile to change the name of numfile
The selection method can be Regular, Logarithmic, Smallest KDE. Use --method=yourmethod to set it
The percentage of selected structures by cluster is set to 10.0%. To change it : --p=newValue

buildSelectedData.py

build a selInput.data file using input.data (the database for nnp-train) and numStructs.csv

Input file : input.data (default), required. You can used --infile=otherfile to change the name of input file
Input file : numStructs.csv (default), required. You can used --numfile=othernumfile to change the name of numfile
Outut file : selInput.data (default). You can use --outfile=otherfile to change the name of output file

Build the reduced data

To reduce the database, you have to run, in this order :

python buildGh5.py
python Clustering.py
#  and probably python buildListFromClustersFile.py
python buildSelectedData.py

Or

python buildGh5.py
python SelectionOnGrid.py
python buildSelectedData.py

Or

python buildGh5.py
python KDE.py
python SelectionByKDE.py
python buildSelectedData.py

As examples, see xAllKMeans, xAllDBSCAN, xAllSelOnGrid, xAllKDE and xAllHDBSCAN in testH2O folder

metrics.py

build several metrcis (min, max, mean, Chi2, ....) of several .h5 data files.

Input files : funct*h5 (default), list of files .h5 to read, containing G values fro several database. You can change it by : --infiles=list of files. Example : --infiles=file1.h5,file2.h5,fileT*.h5
Output files : metrics (default), required. You can used --outfile=otherprefix. outfile is used as prefix of outfiles : one .h file for all Z and one .csv for each Z, datasorted by Chi2, ...
bins : default 100. To change it, --bins =newNumberOfBins
To compute the Chi2 on N dimensions (N=number of G), we build a histogram with nbins = int(-p) for each dimension if p<0 (default -10). If p>0 total number of bins = int(total number of G*p/100)

Authors

Abdulrahman Allouche (Lyon 1 University)

License

This software is licensed under the GNU General Public License version 3 or any later version (GPL-3.0-or-later).

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
src		src
testH2O		testH2O
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

n2p2ReductionData - some scripts to reduce structure database using several machine learning approaches

buildGh5.py

Clustering.py

SelectionOnGrid.py

KDE.py

buildListFromClustersFile.py

SelectionByKDE.py

buildSelectedData.py

Build the reduced data

metrics.py

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

allouchear/n2p2ReductionData

Folders and files

Latest commit

History

Repository files navigation

n2p2ReductionData - some scripts to reduce structure database using several machine learning approaches

buildGh5.py

Clustering.py

SelectionOnGrid.py

KDE.py

buildListFromClustersFile.py

SelectionByKDE.py

buildSelectedData.py

Build the reduced data

metrics.py

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages