Helper codes for databases #174

yingwaili · 2025-10-29T17:40:31Z

This pull request consists of codes for two main functionalities:

Generate metadata from databases.
Generate xyz file from databases.
It currently supports databases with numpy arrays (.npz extension) and HDF5 format (.h5 or .hdf5 extensions).

Codes were originally written by @JaredKeithAveritt and @bnebgen-LANL. Polished by @yingwaili.

lubbersnick · 2025-10-29T19:09:53Z

examples/convert_db_to_extxyz.py

+from hippynn.databases.xyz_database_tools import load_base_database, write_extxyz
+
+def main():
+    if len(sys.argv) < 2:


I would strongly prefer that we use argparse, it is easy to set up and contains a number of quality of life features with very little work. (such as python script.py --help)

lubbersnick · 2025-10-29T19:16:51Z

hippynn/databases/xyz_database_tools.py

+from hippynn.databases.h5_pyanitools import PyAniFileDB
+
+
+def load_base_database(


This looks like a good new feature that maybe we should put in a more central location. It would be highly convenient if we could attempt to auto-detect other kinds as well, such as SNAP or ASE databases, so that code changes were minimized when moving between database formats.

lubbersnick · 2025-10-29T19:19:00Z

examples/metadatabase.py

+# Define base_database by the file extension
+if filetype == ".npz":
+    inputs  = ['coordinates', 'species']
+    targets = ['energy', 'forces']
+    energies_key_sel = 'energy'
+    base_database = NPZDatabase(
+        file=DATA_FILE,
+        seed=101,
+        allow_unfound=True,
+        inputs=inputs,
+        targets=targets,
+        quiet=False
+    )
+
+elif filetype in (".h5", ".hdf5"):
+    inputs  = ['coordinates', 'species']
+    targets = ['energies', 'forces']
+    db_info = {"inputs": inputs , "targets": targets}
+    energies_key_sel = 'energies'
+    base_database = load_db(
+        db_info,
+        en_name,
+        force_name,
+        seed=101,
+        location=DATA_FILE,
+        n_workers=2


Something funny is happening here where this appears to duplicate the intent of your load_base_database in xyz_database_tools ? Is it possible to use that instead?

lubbersnick · 2025-10-29T19:24:28Z

hippynn/databases/xyz_database_tools.py

+    return db, energies_key
+
+
+def write_extxyz(


This is a nice function, similar to write_h5 and write_npz, I think we should add a method to the base database which allows the user to call this from the database object. It would also be best if the expected keys were encoded as default arguments. For example, coordinates='coordinates' by default, but user can pass coordinates='positions' to use that db_name instead.

lubbersnick · 2025-10-29T19:26:04Z

hippynn/databases/xyz_database_tools.py

+    if out_path.exists():
+        if not overwrite:
+            raise FileExistsError(f"Path exists: {out_path}")
+        out_path.unlink()


I would not unlink the existing file until everything else has succeeded.

lubbersnick · 2025-10-29T19:33:45Z

hippynn/databases/metadatabase.py

+
+    # ─── Plotting ──────────────────────────────────────────────────────────────
+
+    def plot_distributions(


There are some hippynn idoms to use with plotting commands that can be helpful here. The typical arguments are shown:bool=False, saved:Union[bool,str]=False. If saved is a str, use that as a file name for saving. If saved is true, use hippynn.settings.DEFAULT_PLOT_FILETYPE along with a default name for the plot. Adding these arguments enables running systems without windowing.

lubbersnick · 2025-10-29T19:46:48Z

hippynn/databases/metadatabase.py

+            out["cell_offsets"] = offs @ cell_np
+        return out
+
+    def calculate_min_distance(self, periodic=True):


We have a routine for this in hippynn.pretraining.calc_min_distances which takes advantage of the graph system, fully GPU, batchable, etc. With small modifications it could return info for find_pairs as well. I don't know how relevant this is in the intended use case for metadata, but a feature which is important in the MLIAP calculations is that if the cells become small enough then you can have the same two particles have multiple sets of pairs. (similarly, consider what happens if we send a 1-atom cell into this code--i might be mistaken but it would actually return no pairs, right? Is that a problem?)

yingwaili added 2 commits October 29, 2025 11:32

Add helper codes to generate metadata from databases (npz or h5 format)

c7fa5c2

Add helper codes to convert databases (npz or h5 format) to xyz format.

2aec2a3

yingwaili requested review from bnebgen-LANL and lubbersnick October 29, 2025 17:41

lubbersnick reviewed Oct 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helper codes for databases #174

Helper codes for databases #174

Uh oh!

yingwaili commented Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from hippynn.databases.h5_pyanitools import PyAniFileDB


		def load_base_database(


		# ─── Plotting ──────────────────────────────────────────────────────────────

		def plot_distributions(

Helper codes for databases #174

Are you sure you want to change the base?

Helper codes for databases #174

Uh oh!

Conversation

yingwaili commented Oct 29, 2025

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

lubbersnick Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants