Correct isolation mapping, dup_ref handling, perf regressions, and #7… by guycipher · Pull Request #85 · tidesdb/tidesql

guycipher · 2026-03-11T07:14:41Z

…2/#73/#74/#77/#78/#74/#75/#76/#79/#80/#81/#82/#83/#84

Correct handlerton compatibility for MariaDB 11.4+/12+. Correct MTR suite for 11.4+ and 12+. Fix concurrent conflict handling (issue #77). Major performance pass eliminating per-row overhead in hot paths.

--- Correctness fixes ---

Isolation levels now respected:

Add resolve_effective_isolation() that maps the MariaDB session's
SET TRANSACTION ISOLATION LEVEL (via thd_tx_isolation()) to the
corresponding TidesDB isolation level:

READ UNCOMMITTED  -> TDB_ISOLATION_READ_UNCOMMITTED
READ COMMITTED    -> TDB_ISOLATION_READ_COMMITTED
REPEATABLE READ   -> TDB_ISOLATION_REPEATABLE_READ
                     (or TDB_ISOLATION_SNAPSHOT if table option says so)
SERIALIZABLE      -> TDB_ISOLATION_SERIALIZABLE

Previously ensure_stmt_txn() and external_lock() used
share->isolation_level (static table option) for multi-statement
txns and hard-coded READ_COMMITTED for autocommit.
start_consistent_snapshot hard-coded REPEATABLE_READ. The session's
tx_isolation was completely ignored.

All three paths now call resolve_effective_isolation(thd,
share->isolation_level). TidesDB's SNAPSHOT level (no SQL
equivalent) is honoured when the session is at REPEATABLE READ and
the table option specifies SNAPSHOT.

Map MariaDB REPEATABLE_READ -> TidesDB SNAPSHOT, which is the
semantic equivalent of InnoDB's repeatable-read (write-write
conflict detection only, no read-set tracking).

dup_ref population (root cause of 'Can't find record' on IODKU):

write_row() now copies the conflicting row's PK bytes into dup_ref
before returning HA_ERR_FOUND_DUPP_KEY, for both PK duplicates and
unique secondary index duplicates.

REPLACE INTO secondary index orphan fix:

Remove write_can_replace_ from the skip_unique check entirely. Now
only the tidesdb_skip_unique_check session variable skips uniqueness
checks. Both REPLACE INTO and INSERT ON DUPLICATE KEY UPDATE let
write_row() return HA_ERR_FOUND_DUPP_KEY so the server properly
handles delete+reinsert and cleans up old secondary index entries
via delete_row().

extra() fix:

HA_EXTRA_INSERT_WITH_UPDATE no longer sets write_can_replace_.

DDL isolation fix:

inplace_alter_table() (ALTER TABLE ADD INDEX) previously used
share->isolation_level for scan transactions. If the table was
created with SERIALIZABLE, the index build would track millions of
read-set entries causing unbounded memory growth. Now hard-coded to
TDB_ISOLATION_READ_COMMITTED for both initial and batch-restart
transactions. DDL never needs OCC conflict detection.

Bulk insert mid-commit isolation:

The batch re-begin in write_row() previously used
share->isolation_level. Now uses TDB_ISOLATION_READ_COMMITTED
since bulk inserts do not need snapshot consistency across batches.

Bulk insert iterator invalidation:

After bulk insert mid-commit creates a fresh transaction, the old
scan_iter held a dangling pointer to the freed txn. Now invalidate
the iterator (tidesdb_iter_free + NULL) and update scan_txn to point
to the new transaction. Bump txn_generation so other handler
objects detect the change.

--- Library fixes (src/tidesdb.c) ---

SNAPSHOT isolation read-set tracking:

tidesdb_txn_add_to_read_set() was only skipping tracking for
isolation levels < TDB_ISOLATION_REPEATABLE_READ (i.e. RU and RC).
SNAPSHOT (enum value 3) fell through and tracked every read, but
SNAPSHOT only needs write-write conflict detection.

Fix: skip read tracking when isolation_level == TDB_ISOLATION_SNAPSHOT.

Also fix tidesdb_txn_check_read_conflicts() to skip for SNAPSHOT.
Previously SNAPSHOT ran full read-conflict and SSI checks at commit,
making it behave identically to SERIALIZABLE.

--- Performance fixes ---

Debug logging gated behind srv_debug_trace:

~25 sql_print_information('[T77]...') calls were firing
unconditionally on every INSERT, SELECT, scan, lock, and commit.
At high throughput this serialises all threads on the error-log
mutex. All calls now gated with if (unlikely(srv_debug_trace)),
matching the existing TDB_TRACE macro pattern. Zero overhead when
disabled.

tidesdb_skip_unique_check session variable:

New per-session boolean: SET SESSION tidesdb_skip_unique_check=1.
Skips both PK and UNIQUE secondary index uniqueness checks during
INSERT. Same pattern as MyRocks' rocksdb_skip_unique_check.
Eliminates the expensive per-row tidesdb_txn_get point lookup
(traverses all LSM levels) during bulk loads.

Stack-allocated index key buffers in update_row():

Replace two std::unique_ptr<uchar[]>(new uchar[MAX_KEY_LENGTH*2+2])
heap allocations per UPDATE with stack-allocated buffers (~4KB each,
well within stack limits).

Dup-check iterator caching:

Cache iterators per unique-index slot in write_row(). Reuse via
seek() instead of iter_new()/iter_free() per row. Add
free_dup_iter_cache() helper; free in close(), delete_all_rows(),
external_lock(F_UNLCK), and bulk insert mid-commit.

Transaction reuse:

Use tidesdb_txn_reset() instead of tidesdb_txn_free() +
tidesdb_txn_begin_with_isolation() in bulk insert mid-commit and
autocommit paths.

Iterator survival:

Preserve iterators across read-only autocommit statements instead of
tearing down and rebuilding.

Skip PK dup check for REPLACE INTO without secondary indexes.

Zero-copy fetch path:

Use get_val_buf_ for BLOB/encrypted path in fetch_row_by_pk().

HA_CLUSTERED_INDEX:

Add HA_CLUSTERED_INDEX to table_flags().

Encryption buffer reuse:

serialize_row() now writes to enc_buf_ and returns it, preserving
row_buf_ capacity across rows. Avoids re-allocation per encrypted
write.

Covering index decode expansion:

New decode_sort_key_part() handles INT, DATE, DATETIME, TIMESTAMP,
YEAR, and CHAR/BINARY (binary/latin1). try_keyread_from_index()
and icp_check_secondary() now use it instead of the old
integer-only type gate.

Encryption key version caching:

Call encryption_key_get_latest_version() once per statement, cache
in cached_enc_key_ver_. Invalidate via enc_key_ver_valid_ = false
at statement end in external_lock(F_UNLCK).

Per-statement syscall/THDVAR caching:

New per-statement caches in ha_tidesdb.h:

cached_time_ / cached_time_valid_  - time(NULL) for TTL
cached_sess_ttl_                   - THDVAR(thd, ttl)
cached_skip_unique_               - THDVAR(thd, skip_unique_check)
cached_thdvars_valid_             - invalidation flag

write_row: ha_thd() 4x, thd_get_ha_data() 3x, THDVAR() 3x -> each
called once and cached. Eliminates ~9 indirect/virtual calls per
row. Similar reductions in compute_row_ttl(), update_row(),
rnd_init(), index_init(), ensure_scan_iter().

All caches invalidated in external_lock(F_UNLCK).

Inplace index build:

Replace std::setstd::string with std::unordered_setstd::string
for O(1) duplicate detection.

--- Feature additions ---

Issue #76: tidesdb_data_home_dir

MYSQL_SYSVAR_STR (read-only) that overrides the auto-computed data
directory. When set, the plugin uses that path instead of
<mysql_datadir>/../tidesdb_data.

Issue #73: SHOW ENGINE TIDESDB STATUS

tidesdb_show_status() callback registered on the handlerton.
Displays DB-level stats (memory, storage, background queues) and
block cache stats (hits, misses, hit rate, entries).

Issue #79: Per-index USE_BTREE

ha_index_option_struct with use_btree boolean, registered via
tidesdb_index_option_list. index_type() checks per-index option
first, falls back to table-level. create() and
inplace_alter_table() apply per-index USE_BTREE when creating
secondary index CFs. Syntax: KEY idx_a (a) USE_BTREE=1.

Issue #78: index_type() override

Returns 'LSM' for default block-based tables and 'BTREE' for
USE_BTREE=1 tables. SHOW KEYS now displays correct index type.

Issue #74: ANALYZE TABLE cardinality

Samples up to 100K entries from each secondary index CF, counts
distinct key prefixes to compute rec_per_key. Before: EXPLAIN
showed rows=1 for any index lookup. After: accurate estimates.

Issue #82: Table option default variables

5 session-scoped dynamic system variables via HA_TOPTION_SYSVAR:
tidesdb_default_compression (NONE/SNAPPY/LZ4/ZSTD/LZ4_FAST)
tidesdb_default_write_buffer_size (default 32MB)
tidesdb_default_bloom_filter (default ON)
tidesdb_default_use_btree (default OFF)
tidesdb_default_block_indexes (default ON)

Issue #81: Conflict information

tidesdb_print_all_conflicts global dynamic sysvar (like
innodb_print_all_deadlocks). When ON, every TDB_ERR_CONFLICT logs
a warning. Last conflict info displayed in SHOW ENGINE TIDESDB
STATUS under 'Conflicts' section.

--- New MTR tests ---

tidesdb_replace_iodku - REPLACE INTO + IODKU with PK, secondary, unique
tidesdb_index_stats - issue #78 index_type + issue #74 cardinality
tidesdb_isolation - session isolation level mapping
tidesdb_engine_status - SHOW ENGINE TIDESDB STATUS
tidesdb_per_index_btree - per-index USE_BTREE
tidesdb_data_home_dir - sysvar visibility and read-only enforcement
tidesdb_insert_conflict - INSERT vs INSERT conflict (awaits library fix)

--- Known issues ---

Issue #77: concurrent UPDATE+DELETE crashes the server (SIGSEGV) in
core MVCC conflict detection. Not fixable in the plugin layer;
requires changes to src/tidesdb.c.

Issue #83: INSERT vs INSERT conflict not detected. Library-level
bug; tidesdb_txn_get within a transaction's snapshot cannot see
uncommitted writes from other transactions by design. Enforcement
must happen at commit time via TDB_ERR_CONFLICT. Fix confirmed
coming in next library release.

UNIQUE secondary index check still creates a full merge-heap
iterator via tidesdb_iter_new per UNIQUE index per INSERT when
skip_unique_check is off. A future tidesdb_txn_exists_prefix() API
would avoid building the full merge heap.

…#73/#74/#77/#78/#74/#75/#76/#79/#80/#81/#82/#83/#84 Correct handlerton compatibility for MariaDB 11.4+/12+. Correct MTR suite for 11.4+ and 12+. Fix concurrent conflict handling (issue #77). Major performance pass eliminating per-row overhead in hot paths. --- Correctness fixes --- Isolation levels now respected: Add resolve_effective_isolation() that maps the MariaDB session's SET TRANSACTION ISOLATION LEVEL (via thd_tx_isolation()) to the corresponding TidesDB isolation level: READ UNCOMMITTED -> TDB_ISOLATION_READ_UNCOMMITTED READ COMMITTED -> TDB_ISOLATION_READ_COMMITTED REPEATABLE READ -> TDB_ISOLATION_REPEATABLE_READ (or TDB_ISOLATION_SNAPSHOT if table option says so) SERIALIZABLE -> TDB_ISOLATION_SERIALIZABLE Previously ensure_stmt_txn() and external_lock() used share->isolation_level (static table option) for multi-statement txns and hard-coded READ_COMMITTED for autocommit. start_consistent_snapshot hard-coded REPEATABLE_READ. The session's tx_isolation was completely ignored. All three paths now call resolve_effective_isolation(thd, share->isolation_level). TidesDB's SNAPSHOT level (no SQL equivalent) is honoured when the session is at REPEATABLE READ and the table option specifies SNAPSHOT. Map MariaDB REPEATABLE_READ -> TidesDB SNAPSHOT, which is the semantic equivalent of InnoDB's repeatable-read (write-write conflict detection only, no read-set tracking). dup_ref population (root cause of 'Can't find record' on IODKU): write_row() now copies the conflicting row's PK bytes into dup_ref before returning HA_ERR_FOUND_DUPP_KEY, for both PK duplicates and unique secondary index duplicates. REPLACE INTO secondary index orphan fix: Remove write_can_replace_ from the skip_unique check entirely. Now only the tidesdb_skip_unique_check session variable skips uniqueness checks. Both REPLACE INTO and INSERT ON DUPLICATE KEY UPDATE let write_row() return HA_ERR_FOUND_DUPP_KEY so the server properly handles delete+reinsert and cleans up old secondary index entries via delete_row(). extra() fix: HA_EXTRA_INSERT_WITH_UPDATE no longer sets write_can_replace_. DDL isolation fix: inplace_alter_table() (ALTER TABLE ADD INDEX) previously used share->isolation_level for scan transactions. If the table was created with SERIALIZABLE, the index build would track millions of read-set entries causing unbounded memory growth. Now hard-coded to TDB_ISOLATION_READ_COMMITTED for both initial and batch-restart transactions. DDL never needs OCC conflict detection. Bulk insert mid-commit isolation: The batch re-begin in write_row() previously used share->isolation_level. Now uses TDB_ISOLATION_READ_COMMITTED since bulk inserts do not need snapshot consistency across batches. Bulk insert iterator invalidation: After bulk insert mid-commit creates a fresh transaction, the old scan_iter held a dangling pointer to the freed txn. Now invalidate the iterator (tidesdb_iter_free + NULL) and update scan_txn to point to the new transaction. Bump txn_generation so other handler objects detect the change. --- Library fixes (src/tidesdb.c) --- SNAPSHOT isolation read-set tracking: tidesdb_txn_add_to_read_set() was only skipping tracking for isolation levels < TDB_ISOLATION_REPEATABLE_READ (i.e. RU and RC). SNAPSHOT (enum value 3) fell through and tracked every read, but SNAPSHOT only needs write-write conflict detection. Fix: skip read tracking when isolation_level == TDB_ISOLATION_SNAPSHOT. Also fix tidesdb_txn_check_read_conflicts() to skip for SNAPSHOT. Previously SNAPSHOT ran full read-conflict and SSI checks at commit, making it behave identically to SERIALIZABLE. --- Performance fixes --- Debug logging gated behind srv_debug_trace: ~25 sql_print_information('[T77]...') calls were firing unconditionally on every INSERT, SELECT, scan, lock, and commit. At high throughput this serialises all threads on the error-log mutex. All calls now gated with if (unlikely(srv_debug_trace)), matching the existing TDB_TRACE macro pattern. Zero overhead when disabled. tidesdb_skip_unique_check session variable: New per-session boolean: SET SESSION tidesdb_skip_unique_check=1. Skips both PK and UNIQUE secondary index uniqueness checks during INSERT. Same pattern as MyRocks' rocksdb_skip_unique_check. Eliminates the expensive per-row tidesdb_txn_get point lookup (traverses all LSM levels) during bulk loads. Stack-allocated index key buffers in update_row(): Replace two std::unique_ptr<uchar[]>(new uchar[MAX_KEY_LENGTH*2+2]) heap allocations per UPDATE with stack-allocated buffers (~4KB each, well within stack limits). Dup-check iterator caching: Cache iterators per unique-index slot in write_row(). Reuse via seek() instead of iter_new()/iter_free() per row. Add free_dup_iter_cache() helper; free in close(), delete_all_rows(), external_lock(F_UNLCK), and bulk insert mid-commit. Transaction reuse: Use tidesdb_txn_reset() instead of tidesdb_txn_free() + tidesdb_txn_begin_with_isolation() in bulk insert mid-commit and autocommit paths. Iterator survival: Preserve iterators across read-only autocommit statements instead of tearing down and rebuilding. Skip PK dup check for REPLACE INTO without secondary indexes. Zero-copy fetch path: Use get_val_buf_ for BLOB/encrypted path in fetch_row_by_pk(). HA_CLUSTERED_INDEX: Add HA_CLUSTERED_INDEX to table_flags(). Encryption buffer reuse: serialize_row() now writes to enc_buf_ and returns it, preserving row_buf_ capacity across rows. Avoids re-allocation per encrypted write. Covering index decode expansion: New decode_sort_key_part() handles INT, DATE, DATETIME, TIMESTAMP, YEAR, and CHAR/BINARY (binary/latin1). try_keyread_from_index() and icp_check_secondary() now use it instead of the old integer-only type gate. Encryption key version caching: Call encryption_key_get_latest_version() once per statement, cache in cached_enc_key_ver_. Invalidate via enc_key_ver_valid_ = false at statement end in external_lock(F_UNLCK). Per-statement syscall/THDVAR caching: New per-statement caches in ha_tidesdb.h: cached_time_ / cached_time_valid_ - time(NULL) for TTL cached_sess_ttl_ - THDVAR(thd, ttl) cached_skip_unique_ - THDVAR(thd, skip_unique_check) cached_thdvars_valid_ - invalidation flag write_row: ha_thd() 4x, thd_get_ha_data() 3x, THDVAR() 3x -> each called once and cached. Eliminates ~9 indirect/virtual calls per row. Similar reductions in compute_row_ttl(), update_row(), rnd_init(), index_init(), ensure_scan_iter(). All caches invalidated in external_lock(F_UNLCK). Inplace index build: Replace std::set<std::string> with std::unordered_set<std::string> for O(1) duplicate detection. --- Feature additions --- Issue #76: tidesdb_data_home_dir MYSQL_SYSVAR_STR (read-only) that overrides the auto-computed data directory. When set, the plugin uses that path instead of <mysql_datadir>/../tidesdb_data. Issue #73: SHOW ENGINE TIDESDB STATUS tidesdb_show_status() callback registered on the handlerton. Displays DB-level stats (memory, storage, background queues) and block cache stats (hits, misses, hit rate, entries). Issue #79: Per-index USE_BTREE ha_index_option_struct with use_btree boolean, registered via tidesdb_index_option_list. index_type() checks per-index option first, falls back to table-level. create() and inplace_alter_table() apply per-index USE_BTREE when creating secondary index CFs. Syntax: KEY idx_a (a) USE_BTREE=1. Issue #78: index_type() override Returns 'LSM' for default block-based tables and 'BTREE' for USE_BTREE=1 tables. SHOW KEYS now displays correct index type. Issue #74: ANALYZE TABLE cardinality Samples up to 100K entries from each secondary index CF, counts distinct key prefixes to compute rec_per_key. Before: EXPLAIN showed rows=1 for any index lookup. After: accurate estimates. Issue #82: Table option default variables 5 session-scoped dynamic system variables via HA_TOPTION_SYSVAR: tidesdb_default_compression (NONE/SNAPPY/LZ4/ZSTD/LZ4_FAST) tidesdb_default_write_buffer_size (default 32MB) tidesdb_default_bloom_filter (default ON) tidesdb_default_use_btree (default OFF) tidesdb_default_block_indexes (default ON) Issue #81: Conflict information tidesdb_print_all_conflicts global dynamic sysvar (like innodb_print_all_deadlocks). When ON, every TDB_ERR_CONFLICT logs a warning. Last conflict info displayed in SHOW ENGINE TIDESDB STATUS under 'Conflicts' section. --- New MTR tests --- tidesdb_replace_iodku - REPLACE INTO + IODKU with PK, secondary, unique tidesdb_index_stats - issue #78 index_type + issue #74 cardinality tidesdb_isolation - session isolation level mapping tidesdb_engine_status - SHOW ENGINE TIDESDB STATUS tidesdb_per_index_btree - per-index USE_BTREE tidesdb_data_home_dir - sysvar visibility and read-only enforcement tidesdb_insert_conflict - INSERT vs INSERT conflict (awaits library fix) --- Known issues --- Issue #77: concurrent UPDATE+DELETE crashes the server (SIGSEGV) in core MVCC conflict detection. Not fixable in the plugin layer; requires changes to src/tidesdb.c. Issue #83: INSERT vs INSERT conflict not detected. Library-level bug; tidesdb_txn_get within a transaction's snapshot cannot see uncommitted writes from other transactions by design. Enforcement must happen at commit time via TDB_ERR_CONFLICT. Fix confirmed coming in next library release. UNIQUE secondary index check still creates a full merge-heap iterator via tidesdb_iter_new per UNIQUE index per INSERT when skip_unique_check is off. A future tidesdb_txn_exists_prefix() API would avoid building the full merge heap.

* Deleted 20 redundant .opt files that only contained --plugin-load-add=ha_tidesdb + --plugin-maturity=unknown * Stripped redundant lines from the 7 remaining .opt files that had actual unique options (encryption keys, loose test vars)

guycipher self-assigned this Mar 11, 2026

guycipher added 2 commits March 11, 2026 03:28

use 12.2.2 mdb

a39fdb6

guycipher merged commit d0979b5 into master Mar 11, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct isolation mapping, dup_ref handling, perf regressions, and #7…#85

Correct isolation mapping, dup_ref handling, perf regressions, and #7…#85
guycipher merged 3 commits intomasterfrom
340minor

guycipher commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guycipher commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant