Skip to content

Increasingly slow to add events to ASDF dataset #72

@stepholinger

Description

@stepholinger

I am using pyasdf to store catalogs of detected events, which can include thousands of obpsy events and waveforms.

When adding these objects to ASDF dataset with ds.add_quakeml(), I found that it becomes increasingly slower to add events to the dataset the larger the number of events in the dataset is. I suspect this may be due to checking for redundancy between the new event and all the events already in the dataset.

I had initially designed a workflow that looped through a list of detections and added an obspy event and waveforms associated with the event via the event_id for each detection. This makes some intuitive sense, but becomes very slow when adding thousands of waveforms and events. This appears to be circumvented by first loading the events into an obspy catalog and adding the entire catalog to the ASDF dataset. This is a fine solution, but should probably be mentioned in the documentation as the preferred method.

The attached snippet of code should simply demonstrate the issue.

import obspy
from obspy.core.event import Event
from obspy.core.event import Catalog
from obspy.core.event import Origin
import time
import pyasdf
import matplotlib.pyplot as plt

# open ASDF dataset
ds = pyasdf.ASDFDataSet("test_dataset.h5")

# set up run
num_it = 100
times = []

# add each event individually
for i in range(num_it):

    # start timer
    timer = time.time()

    # make obspy event
    event = Event()
    event.event_type = "ice quake"
    origin = Origin()
    origin.time = obspy.UTCDateTime(2000+i,1,1)
    event.origins = [origin]

    # add event to ASDF dataset
    ds.add_quakeml(event)

    # stop timer
    runtime = time.time() - timer
    times.append(runtime)

# plot results of adding individual events
fig = plt.plot(range(num_it),times)
ax = plt.gca()
ax.set_ylabel("Time to add event (seconds)")
ax.set_xlabel("Number of events in dataset")
plt.show()
print("Total time to add " + str(num_it) + " events individually: " + str(sum(times)) + " seconds")

# put in an obspy catalog first
event_list = []

# start timer
timer = time.time()

for i in range(num_it):

    # make obspy event
    event = Event()
    event.event_type = "ice quake"
    origin = Origin()
    origin.time = obspy.UTCDateTime(2000+i,1,1)
    event.origins = [origin]

    # add event to list
    event_list.append(event)

catalog = Catalog(event_list)
ds.add_quakeml(catalog)

# stop timer
runtime = time.time() - timer

# print result of adding catalog
print("Total time to add " + str(num_it) + " events in catalog: " + str(runtime) + " seconds")

Screenshot from 2021-08-11 16-38-26

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions