-
Notifications
You must be signed in to change notification settings - Fork 34
Helper codes for databases #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
| from hippynn.databases.xyz_database_tools import load_base_database, write_extxyz | ||
|
|
||
| def main(): | ||
| if len(sys.argv) < 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly prefer that we use argparse, it is easy to set up and contains a number of quality of life features with very little work. (such as python script.py --help)
| from hippynn.databases.h5_pyanitools import PyAniFileDB | ||
|
|
||
|
|
||
| def load_base_database( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good new feature that maybe we should put in a more central location. It would be highly convenient if we could attempt to auto-detect other kinds as well, such as SNAP or ASE databases, so that code changes were minimized when moving between database formats.
| # Define base_database by the file extension | ||
| if filetype == ".npz": | ||
| inputs = ['coordinates', 'species'] | ||
| targets = ['energy', 'forces'] | ||
| energies_key_sel = 'energy' | ||
| base_database = NPZDatabase( | ||
| file=DATA_FILE, | ||
| seed=101, | ||
| allow_unfound=True, | ||
| inputs=inputs, | ||
| targets=targets, | ||
| quiet=False | ||
| ) | ||
|
|
||
| elif filetype in (".h5", ".hdf5"): | ||
| inputs = ['coordinates', 'species'] | ||
| targets = ['energies', 'forces'] | ||
| db_info = {"inputs": inputs , "targets": targets} | ||
| energies_key_sel = 'energies' | ||
| base_database = load_db( | ||
| db_info, | ||
| en_name, | ||
| force_name, | ||
| seed=101, | ||
| location=DATA_FILE, | ||
| n_workers=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something funny is happening here where this appears to duplicate the intent of your load_base_database in xyz_database_tools ? Is it possible to use that instead?
| return db, energies_key | ||
|
|
||
|
|
||
| def write_extxyz( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice function, similar to write_h5 and write_npz, I think we should add a method to the base database which allows the user to call this from the database object. It would also be best if the expected keys were encoded as default arguments. For example, coordinates='coordinates' by default, but user can pass coordinates='positions' to use that db_name instead.
| if out_path.exists(): | ||
| if not overwrite: | ||
| raise FileExistsError(f"Path exists: {out_path}") | ||
| out_path.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not unlink the existing file until everything else has succeeded.
|
|
||
| # ─── Plotting ────────────────────────────────────────────────────────────── | ||
|
|
||
| def plot_distributions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some hippynn idoms to use with plotting commands that can be helpful here. The typical arguments are shown:bool=False, saved:Union[bool,str]=False. If saved is a str, use that as a file name for saving. If saved is true, use hippynn.settings.DEFAULT_PLOT_FILETYPE along with a default name for the plot. Adding these arguments enables running systems without windowing.
| out["cell_offsets"] = offs @ cell_np | ||
| return out | ||
|
|
||
| def calculate_min_distance(self, periodic=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a routine for this in hippynn.pretraining.calc_min_distances which takes advantage of the graph system, fully GPU, batchable, etc. With small modifications it could return info for find_pairs as well. I don't know how relevant this is in the intended use case for metadata, but a feature which is important in the MLIAP calculations is that if the cells become small enough then you can have the same two particles have multiple sets of pairs. (similarly, consider what happens if we send a 1-atom cell into this code--i might be mistaken but it would actually return no pairs, right? Is that a problem?)
This pull request consists of codes for two main functionalities:
It currently supports databases with numpy arrays (.npz extension) and HDF5 format (.h5 or .hdf5 extensions).
Codes were originally written by @JaredKeithAveritt and @bnebgen-LANL. Polished by @yingwaili.