Skip to content

Conversation

@yingwaili
Copy link
Collaborator

This pull request consists of codes for two main functionalities:

  1. Generate metadata from databases.
  2. Generate xyz file from databases.
    It currently supports databases with numpy arrays (.npz extension) and HDF5 format (.h5 or .hdf5 extensions).

Codes were originally written by @JaredKeithAveritt and @bnebgen-LANL. Polished by @yingwaili.

from hippynn.databases.xyz_database_tools import load_base_database, write_extxyz

def main():
if len(sys.argv) < 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer that we use argparse, it is easy to set up and contains a number of quality of life features with very little work. (such as python script.py --help)

from hippynn.databases.h5_pyanitools import PyAniFileDB


def load_base_database(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good new feature that maybe we should put in a more central location. It would be highly convenient if we could attempt to auto-detect other kinds as well, such as SNAP or ASE databases, so that code changes were minimized when moving between database formats.

Comment on lines +60 to +85
# Define base_database by the file extension
if filetype == ".npz":
inputs = ['coordinates', 'species']
targets = ['energy', 'forces']
energies_key_sel = 'energy'
base_database = NPZDatabase(
file=DATA_FILE,
seed=101,
allow_unfound=True,
inputs=inputs,
targets=targets,
quiet=False
)

elif filetype in (".h5", ".hdf5"):
inputs = ['coordinates', 'species']
targets = ['energies', 'forces']
db_info = {"inputs": inputs , "targets": targets}
energies_key_sel = 'energies'
base_database = load_db(
db_info,
en_name,
force_name,
seed=101,
location=DATA_FILE,
n_workers=2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something funny is happening here where this appears to duplicate the intent of your load_base_database in xyz_database_tools ? Is it possible to use that instead?

return db, energies_key


def write_extxyz(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice function, similar to write_h5 and write_npz, I think we should add a method to the base database which allows the user to call this from the database object. It would also be best if the expected keys were encoded as default arguments. For example, coordinates='coordinates' by default, but user can pass coordinates='positions' to use that db_name instead.

if out_path.exists():
if not overwrite:
raise FileExistsError(f"Path exists: {out_path}")
out_path.unlink()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not unlink the existing file until everything else has succeeded.


# ─── Plotting ──────────────────────────────────────────────────────────────

def plot_distributions(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some hippynn idoms to use with plotting commands that can be helpful here. The typical arguments are shown:bool=False, saved:Union[bool,str]=False. If saved is a str, use that as a file name for saving. If saved is true, use hippynn.settings.DEFAULT_PLOT_FILETYPE along with a default name for the plot. Adding these arguments enables running systems without windowing.

out["cell_offsets"] = offs @ cell_np
return out

def calculate_min_distance(self, periodic=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a routine for this in hippynn.pretraining.calc_min_distances which takes advantage of the graph system, fully GPU, batchable, etc. With small modifications it could return info for find_pairs as well. I don't know how relevant this is in the intended use case for metadata, but a feature which is important in the MLIAP calculations is that if the cells become small enough then you can have the same two particles have multiple sets of pairs. (similarly, consider what happens if we send a 1-atom cell into this code--i might be mistaken but it would actually return no pairs, right? Is that a problem?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants