-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Hi,
I'm trying to run some YCSB workloads on Viper and I'm encountering an error (copied below) running Workload F (https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadf) with multiple threads. I'm curious if you encountered this issue when evaluating Viper or if you know what might be happening. It happens on almost every run with 16 threads, and less often with fewer threads; I haven't seen it at all with 1 thread. I think I've also seen it happen occasionally on other workloads with multiple threads, but running workload F consistently triggers the error.
I'm running workload F with 600K operations and records. I run Load E first, which is always successful. I'm using 32B keys and 1140B values, which roughly match the sizes used by YCSB; when keys are smaller than 32B, they are padded with spaces. I wrote a small wrapper around Viper that I compile to a shared library (libviper_wrapper.so) and link to a Java YCSB client for Viper; I don't think the error stems from my wrapper because most of the workloads run correctly, and workload F always runs for a while before the error occurs.
When I execute Run F, the workload runs for a while and a segmentation fault eventually occurs in Viper, resulting in output like the following. The error comes through Java via YCSB, but the error itself is happening in some Viper code.
... # truncated
Start resizing.
Added data file "/mnt/pmem/viper/data27"
Allocated 43690 blocks in 1 GiB.
End resizing.
Start resizing.
Added data file "/mnt/pmem/viper/data28"
Allocated 43690 blocks in 1 GiB.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007ff85365f82d, pid=6129, tid=6162
#
# JRE version: OpenJDK Runtime Environment (21.0.6+7) (build 21.0.6+7-Debian-1)
# Java VM: OpenJDK 64-Bit Server VM (21.0.6+7-Debian-1, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libviper_wrapper.so+0x5282d] viper::Viper<viper::kv_bm::BMRecord<unsigned char, 32ul>, viper::kv_bm::BMRecord<unsigned char, 1140ul> >::ReadOnlyClient::get_const_entry_from_offset(viper::KeyValueOffset) const+0x6d
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as: ... # truncated
End resizing.
[7.225s][warning][os] Loading hsdis library failed
# ... truncated
Running addr2line -e libviper_wrapper.so 0x5282d points to
Line 1523 in 79ebf6e
| const auto& entry = this->viper_.v_blocks_[block]->v_pages[page].data[slot]; |
The error also occasionally happens in
get_value_from_offset on a similar line: Line 1563 in 79ebf6e
| const VPage& v_page = this->viper_.v_blocks_[block]->v_pages[page]; |
I tried to isolate the specific part of those lines where the segmentation fault occurs and it appears to be
this->viper_.v_blocks_[block].
I have a theory about why this error may be happening (although I'm not very familiar with the Viper source code, so this could be completely off base). The error always appears to occur during resizing, between a Allocated 43690 blocks in 1 GiB. line and an End resizing. line. I also noticed that the resizing code updates v_blocks_ and while a compare-and-swap prevents multiple threads from attempting to resize at the same time, accesses to v_blocks_ don't appear to be protected by a lock. Is it possible there's a race condition happening here?
I'm running experiments on Debian Trixie, using Optane PM (the bug manifests with both 128GiB non-interleaved and 512GiB interleaved PM) and ext4-DAX as the file system. I'm happy to provide additional details about my environment.