See details in eloqdata/eloqkv#224.
Crash Diagnosis
- The core dump happens while ScanCommand is tearing down its scanner. When the std::unique_ptrtxservice::store::DataStoreScanner in
ExecuteCommand goes out of scope (src/redis_service.cpp:6388), RocksDBScanner is destroyed and releases its rocksdb::Iterator.
- During iterator cleanup RocksDB runs CleanupSuperVersionHandle, which tries to lock the DB’s internal mutex. Because the mutex has already been
destroyed, pthread_mutex_lock returns EINVAL, triggering rocksdb::port::PthreadCall("lock") to call abort() (frames 3–7 in the stack trace).
- That means the underlying rocksdb::DBImpl was shut down before the iterator finished. RocksDBHandlerImpl::Shutdown() deletes db_ and destroys its
mutex (store_handler/rocksdb_handler.cpp:3115), while ScanForward simply hands out the raw db_ pointer to RocksDBScanner without holding db_mux_
or any lifetime guard (store_handler/rocksdb_handler.cpp:1223).
Why It Happens
If another thread runs Shutdown() (for example during node failover) while a scan is still active, the DB object vanishes; the iterator held by the
scan then crashes when it’s destroyed.
Fix Ideas
- Keep db_mux_ locked (or otherwise reference-count the DB) for the full lifetime of each RocksDBScanner, preventing shutdown until scanners
finish.
- Or have Shutdown() wait until all scanners/iterators complete before deleting db_.
- At minimum, check whether GetDBPtr() is still valid before creating the scanner and abort the scan early if the DB is closing.
Logs around shutdown or DB restarts will likely confirm this sequence.