Skip to content

Correct isolation mapping, dup_ref handling, perf regressions, and #7…#85

Merged
guycipher merged 3 commits intomasterfrom
340minor
Mar 11, 2026
Merged

Correct isolation mapping, dup_ref handling, perf regressions, and #7…#85
guycipher merged 3 commits intomasterfrom
340minor

Conversation

@guycipher
Copy link
Member

…2/#73/#74/#77/#78/#74/#75/#76/#79/#80/#81/#82/#83/#84

Correct handlerton compatibility for MariaDB 11.4+/12+. Correct MTR suite for 11.4+ and 12+. Fix concurrent conflict handling (issue #77). Major performance pass eliminating per-row overhead in hot paths.

--- Correctness fixes ---

Isolation levels now respected:

Add resolve_effective_isolation() that maps the MariaDB session's
SET TRANSACTION ISOLATION LEVEL (via thd_tx_isolation()) to the
corresponding TidesDB isolation level:

READ UNCOMMITTED  -> TDB_ISOLATION_READ_UNCOMMITTED
READ COMMITTED    -> TDB_ISOLATION_READ_COMMITTED
REPEATABLE READ   -> TDB_ISOLATION_REPEATABLE_READ
                     (or TDB_ISOLATION_SNAPSHOT if table option says so)
SERIALIZABLE      -> TDB_ISOLATION_SERIALIZABLE

Previously ensure_stmt_txn() and external_lock() used
share->isolation_level (static table option) for multi-statement
txns and hard-coded READ_COMMITTED for autocommit.
start_consistent_snapshot hard-coded REPEATABLE_READ. The session's
tx_isolation was completely ignored.

All three paths now call resolve_effective_isolation(thd,
share->isolation_level). TidesDB's SNAPSHOT level (no SQL
equivalent) is honoured when the session is at REPEATABLE READ and
the table option specifies SNAPSHOT.

Map MariaDB REPEATABLE_READ -> TidesDB SNAPSHOT, which is the
semantic equivalent of InnoDB's repeatable-read (write-write
conflict detection only, no read-set tracking).

dup_ref population (root cause of 'Can't find record' on IODKU):

write_row() now copies the conflicting row's PK bytes into dup_ref
before returning HA_ERR_FOUND_DUPP_KEY, for both PK duplicates and
unique secondary index duplicates.

REPLACE INTO secondary index orphan fix:

Remove write_can_replace_ from the skip_unique check entirely. Now
only the tidesdb_skip_unique_check session variable skips uniqueness
checks. Both REPLACE INTO and INSERT ON DUPLICATE KEY UPDATE let
write_row() return HA_ERR_FOUND_DUPP_KEY so the server properly
handles delete+reinsert and cleans up old secondary index entries
via delete_row().

extra() fix:

HA_EXTRA_INSERT_WITH_UPDATE no longer sets write_can_replace_.

DDL isolation fix:

inplace_alter_table() (ALTER TABLE ADD INDEX) previously used
share->isolation_level for scan transactions. If the table was
created with SERIALIZABLE, the index build would track millions of
read-set entries causing unbounded memory growth. Now hard-coded to
TDB_ISOLATION_READ_COMMITTED for both initial and batch-restart
transactions. DDL never needs OCC conflict detection.

Bulk insert mid-commit isolation:

The batch re-begin in write_row() previously used
share->isolation_level. Now uses TDB_ISOLATION_READ_COMMITTED
since bulk inserts do not need snapshot consistency across batches.

Bulk insert iterator invalidation:

After bulk insert mid-commit creates a fresh transaction, the old
scan_iter held a dangling pointer to the freed txn. Now invalidate
the iterator (tidesdb_iter_free + NULL) and update scan_txn to point
to the new transaction. Bump txn_generation so other handler
objects detect the change.

--- Library fixes (src/tidesdb.c) ---

SNAPSHOT isolation read-set tracking:

tidesdb_txn_add_to_read_set() was only skipping tracking for
isolation levels < TDB_ISOLATION_REPEATABLE_READ (i.e. RU and RC).
SNAPSHOT (enum value 3) fell through and tracked every read, but
SNAPSHOT only needs write-write conflict detection.

Fix: skip read tracking when isolation_level == TDB_ISOLATION_SNAPSHOT.

Also fix tidesdb_txn_check_read_conflicts() to skip for SNAPSHOT.
Previously SNAPSHOT ran full read-conflict and SSI checks at commit,
making it behave identically to SERIALIZABLE.

--- Performance fixes ---

Debug logging gated behind srv_debug_trace:

~25 sql_print_information('[T77]...') calls were firing
unconditionally on every INSERT, SELECT, scan, lock, and commit.
At high throughput this serialises all threads on the error-log
mutex. All calls now gated with if (unlikely(srv_debug_trace)),
matching the existing TDB_TRACE macro pattern. Zero overhead when
disabled.

tidesdb_skip_unique_check session variable:

New per-session boolean: SET SESSION tidesdb_skip_unique_check=1.
Skips both PK and UNIQUE secondary index uniqueness checks during
INSERT. Same pattern as MyRocks' rocksdb_skip_unique_check.
Eliminates the expensive per-row tidesdb_txn_get point lookup
(traverses all LSM levels) during bulk loads.

Stack-allocated index key buffers in update_row():

Replace two std::unique_ptr<uchar[]>(new uchar[MAX_KEY_LENGTH*2+2])
heap allocations per UPDATE with stack-allocated buffers (~4KB each,
well within stack limits).

Dup-check iterator caching:

Cache iterators per unique-index slot in write_row(). Reuse via
seek() instead of iter_new()/iter_free() per row. Add
free_dup_iter_cache() helper; free in close(), delete_all_rows(),
external_lock(F_UNLCK), and bulk insert mid-commit.

Transaction reuse:

Use tidesdb_txn_reset() instead of tidesdb_txn_free() +
tidesdb_txn_begin_with_isolation() in bulk insert mid-commit and
autocommit paths.

Iterator survival:

Preserve iterators across read-only autocommit statements instead of
tearing down and rebuilding.

Skip PK dup check for REPLACE INTO without secondary indexes.

Zero-copy fetch path:

Use get_val_buf_ for BLOB/encrypted path in fetch_row_by_pk().

HA_CLUSTERED_INDEX:

Add HA_CLUSTERED_INDEX to table_flags().

Encryption buffer reuse:

serialize_row() now writes to enc_buf_ and returns it, preserving
row_buf_ capacity across rows. Avoids re-allocation per encrypted
write.

Covering index decode expansion:

New decode_sort_key_part() handles INT, DATE, DATETIME, TIMESTAMP,
YEAR, and CHAR/BINARY (binary/latin1). try_keyread_from_index()
and icp_check_secondary() now use it instead of the old
integer-only type gate.

Encryption key version caching:

Call encryption_key_get_latest_version() once per statement, cache
in cached_enc_key_ver_. Invalidate via enc_key_ver_valid_ = false
at statement end in external_lock(F_UNLCK).

Per-statement syscall/THDVAR caching:

New per-statement caches in ha_tidesdb.h:

cached_time_ / cached_time_valid_  - time(NULL) for TTL
cached_sess_ttl_                   - THDVAR(thd, ttl)
cached_skip_unique_               - THDVAR(thd, skip_unique_check)
cached_thdvars_valid_             - invalidation flag

write_row: ha_thd() 4x, thd_get_ha_data() 3x, THDVAR() 3x -> each
called once and cached. Eliminates ~9 indirect/virtual calls per
row. Similar reductions in compute_row_ttl(), update_row(),
rnd_init(), index_init(), ensure_scan_iter().

All caches invalidated in external_lock(F_UNLCK).

Inplace index build:

Replace std::setstd::string with std::unordered_setstd::string
for O(1) duplicate detection.

--- Feature additions ---

Issue #76: tidesdb_data_home_dir

MYSQL_SYSVAR_STR (read-only) that overrides the auto-computed data
directory. When set, the plugin uses that path instead of
<mysql_datadir>/../tidesdb_data.

Issue #73: SHOW ENGINE TIDESDB STATUS

tidesdb_show_status() callback registered on the handlerton.
Displays DB-level stats (memory, storage, background queues) and
block cache stats (hits, misses, hit rate, entries).

Issue #79: Per-index USE_BTREE

ha_index_option_struct with use_btree boolean, registered via
tidesdb_index_option_list. index_type() checks per-index option
first, falls back to table-level. create() and
inplace_alter_table() apply per-index USE_BTREE when creating
secondary index CFs. Syntax: KEY idx_a (a) USE_BTREE=1.

Issue #78: index_type() override

Returns 'LSM' for default block-based tables and 'BTREE' for
USE_BTREE=1 tables. SHOW KEYS now displays correct index type.

Issue #74: ANALYZE TABLE cardinality

Samples up to 100K entries from each secondary index CF, counts
distinct key prefixes to compute rec_per_key. Before: EXPLAIN
showed rows=1 for any index lookup. After: accurate estimates.

Issue #82: Table option default variables

5 session-scoped dynamic system variables via HA_TOPTION_SYSVAR:
tidesdb_default_compression (NONE/SNAPPY/LZ4/ZSTD/LZ4_FAST)
tidesdb_default_write_buffer_size (default 32MB)
tidesdb_default_bloom_filter (default ON)
tidesdb_default_use_btree (default OFF)
tidesdb_default_block_indexes (default ON)

Issue #81: Conflict information

tidesdb_print_all_conflicts global dynamic sysvar (like
innodb_print_all_deadlocks). When ON, every TDB_ERR_CONFLICT logs
a warning. Last conflict info displayed in SHOW ENGINE TIDESDB
STATUS under 'Conflicts' section.

--- New MTR tests ---

tidesdb_replace_iodku - REPLACE INTO + IODKU with PK, secondary, unique
tidesdb_index_stats - issue #78 index_type + issue #74 cardinality
tidesdb_isolation - session isolation level mapping
tidesdb_engine_status - SHOW ENGINE TIDESDB STATUS
tidesdb_per_index_btree - per-index USE_BTREE
tidesdb_data_home_dir - sysvar visibility and read-only enforcement
tidesdb_insert_conflict - INSERT vs INSERT conflict (awaits library fix)

--- Known issues ---

Issue #77: concurrent UPDATE+DELETE crashes the server (SIGSEGV) in
core MVCC conflict detection. Not fixable in the plugin layer;
requires changes to src/tidesdb.c.

Issue #83: INSERT vs INSERT conflict not detected. Library-level
bug; tidesdb_txn_get within a transaction's snapshot cannot see
uncommitted writes from other transactions by design. Enforcement
must happen at commit time via TDB_ERR_CONFLICT. Fix confirmed
coming in next library release.

UNIQUE secondary index check still creates a full merge-heap
iterator via tidesdb_iter_new per UNIQUE index per INSERT when
skip_unique_check is off. A future tidesdb_txn_exists_prefix() API
would avoid building the full merge heap.

…#73/#74/#77/#78/#74/#75/#76/#79/#80/#81/#82/#83/#84

Correct handlerton compatibility for MariaDB 11.4+/12+. Correct MTR
suite for 11.4+ and 12+. Fix concurrent conflict handling (issue #77).
Major performance pass eliminating per-row overhead in hot paths.

--- Correctness fixes ---

Isolation levels now respected:

  Add resolve_effective_isolation() that maps the MariaDB session's
  SET TRANSACTION ISOLATION LEVEL (via thd_tx_isolation()) to the
  corresponding TidesDB isolation level:

    READ UNCOMMITTED  -> TDB_ISOLATION_READ_UNCOMMITTED
    READ COMMITTED    -> TDB_ISOLATION_READ_COMMITTED
    REPEATABLE READ   -> TDB_ISOLATION_REPEATABLE_READ
                         (or TDB_ISOLATION_SNAPSHOT if table option says so)
    SERIALIZABLE      -> TDB_ISOLATION_SERIALIZABLE

  Previously ensure_stmt_txn() and external_lock() used
  share->isolation_level (static table option) for multi-statement
  txns and hard-coded READ_COMMITTED for autocommit.
  start_consistent_snapshot hard-coded REPEATABLE_READ.  The session's
  tx_isolation was completely ignored.

  All three paths now call resolve_effective_isolation(thd,
  share->isolation_level).  TidesDB's SNAPSHOT level (no SQL
  equivalent) is honoured when the session is at REPEATABLE READ and
  the table option specifies SNAPSHOT.

  Map MariaDB REPEATABLE_READ -> TidesDB SNAPSHOT, which is the
  semantic equivalent of InnoDB's repeatable-read (write-write
  conflict detection only, no read-set tracking).

dup_ref population (root cause of 'Can't find record' on IODKU):

  write_row() now copies the conflicting row's PK bytes into dup_ref
  before returning HA_ERR_FOUND_DUPP_KEY, for both PK duplicates and
  unique secondary index duplicates.

REPLACE INTO secondary index orphan fix:

  Remove write_can_replace_ from the skip_unique check entirely.  Now
  only the tidesdb_skip_unique_check session variable skips uniqueness
  checks.  Both REPLACE INTO and INSERT ON DUPLICATE KEY UPDATE let
  write_row() return HA_ERR_FOUND_DUPP_KEY so the server properly
  handles delete+reinsert and cleans up old secondary index entries
  via delete_row().

extra() fix:

  HA_EXTRA_INSERT_WITH_UPDATE no longer sets write_can_replace_.

DDL isolation fix:

  inplace_alter_table() (ALTER TABLE ADD INDEX) previously used
  share->isolation_level for scan transactions.  If the table was
  created with SERIALIZABLE, the index build would track millions of
  read-set entries causing unbounded memory growth.  Now hard-coded to
  TDB_ISOLATION_READ_COMMITTED for both initial and batch-restart
  transactions.  DDL never needs OCC conflict detection.

Bulk insert mid-commit isolation:

  The batch re-begin in write_row() previously used
  share->isolation_level.  Now uses TDB_ISOLATION_READ_COMMITTED
  since bulk inserts do not need snapshot consistency across batches.

Bulk insert iterator invalidation:

  After bulk insert mid-commit creates a fresh transaction, the old
  scan_iter held a dangling pointer to the freed txn.  Now invalidate
  the iterator (tidesdb_iter_free + NULL) and update scan_txn to point
  to the new transaction.  Bump txn_generation so other handler
  objects detect the change.

--- Library fixes (src/tidesdb.c) ---

SNAPSHOT isolation read-set tracking:

  tidesdb_txn_add_to_read_set() was only skipping tracking for
  isolation levels < TDB_ISOLATION_REPEATABLE_READ (i.e. RU and RC).
  SNAPSHOT (enum value 3) fell through and tracked every read, but
  SNAPSHOT only needs write-write conflict detection.

  Fix: skip read tracking when isolation_level == TDB_ISOLATION_SNAPSHOT.

  Also fix tidesdb_txn_check_read_conflicts() to skip for SNAPSHOT.
  Previously SNAPSHOT ran full read-conflict and SSI checks at commit,
  making it behave identically to SERIALIZABLE.

--- Performance fixes ---

Debug logging gated behind srv_debug_trace:

  ~25 sql_print_information('[T77]...') calls were firing
  unconditionally on every INSERT, SELECT, scan, lock, and commit.
  At high throughput this serialises all threads on the error-log
  mutex.  All calls now gated with if (unlikely(srv_debug_trace)),
  matching the existing TDB_TRACE macro pattern.  Zero overhead when
  disabled.

tidesdb_skip_unique_check session variable:

  New per-session boolean: SET SESSION tidesdb_skip_unique_check=1.
  Skips both PK and UNIQUE secondary index uniqueness checks during
  INSERT.  Same pattern as MyRocks' rocksdb_skip_unique_check.
  Eliminates the expensive per-row tidesdb_txn_get point lookup
  (traverses all LSM levels) during bulk loads.

Stack-allocated index key buffers in update_row():

  Replace two std::unique_ptr<uchar[]>(new uchar[MAX_KEY_LENGTH*2+2])
  heap allocations per UPDATE with stack-allocated buffers (~4KB each,
  well within stack limits).

Dup-check iterator caching:

  Cache iterators per unique-index slot in write_row().  Reuse via
  seek() instead of iter_new()/iter_free() per row.  Add
  free_dup_iter_cache() helper; free in close(), delete_all_rows(),
  external_lock(F_UNLCK), and bulk insert mid-commit.

Transaction reuse:

  Use tidesdb_txn_reset() instead of tidesdb_txn_free() +
  tidesdb_txn_begin_with_isolation() in bulk insert mid-commit and
  autocommit paths.

Iterator survival:

  Preserve iterators across read-only autocommit statements instead of
  tearing down and rebuilding.

Skip PK dup check for REPLACE INTO without secondary indexes.

Zero-copy fetch path:

  Use get_val_buf_ for BLOB/encrypted path in fetch_row_by_pk().

HA_CLUSTERED_INDEX:

  Add HA_CLUSTERED_INDEX to table_flags().

Encryption buffer reuse:

  serialize_row() now writes to enc_buf_ and returns it, preserving
  row_buf_ capacity across rows.  Avoids re-allocation per encrypted
  write.

Covering index decode expansion:

  New decode_sort_key_part() handles INT, DATE, DATETIME, TIMESTAMP,
  YEAR, and CHAR/BINARY (binary/latin1).  try_keyread_from_index()
  and icp_check_secondary() now use it instead of the old
  integer-only type gate.

Encryption key version caching:

  Call encryption_key_get_latest_version() once per statement, cache
  in cached_enc_key_ver_.  Invalidate via enc_key_ver_valid_ = false
  at statement end in external_lock(F_UNLCK).

Per-statement syscall/THDVAR caching:

  New per-statement caches in ha_tidesdb.h:

    cached_time_ / cached_time_valid_  - time(NULL) for TTL
    cached_sess_ttl_                   - THDVAR(thd, ttl)
    cached_skip_unique_               - THDVAR(thd, skip_unique_check)
    cached_thdvars_valid_             - invalidation flag

  write_row: ha_thd() 4x, thd_get_ha_data() 3x, THDVAR() 3x -> each
  called once and cached.  Eliminates ~9 indirect/virtual calls per
  row.  Similar reductions in compute_row_ttl(), update_row(),
  rnd_init(), index_init(), ensure_scan_iter().

  All caches invalidated in external_lock(F_UNLCK).

Inplace index build:

  Replace std::set<std::string> with std::unordered_set<std::string>
  for O(1) duplicate detection.

--- Feature additions ---

Issue #76: tidesdb_data_home_dir

  MYSQL_SYSVAR_STR (read-only) that overrides the auto-computed data
  directory.  When set, the plugin uses that path instead of
  <mysql_datadir>/../tidesdb_data.

Issue #73: SHOW ENGINE TIDESDB STATUS

  tidesdb_show_status() callback registered on the handlerton.
  Displays DB-level stats (memory, storage, background queues) and
  block cache stats (hits, misses, hit rate, entries).

Issue #79: Per-index USE_BTREE

  ha_index_option_struct with use_btree boolean, registered via
  tidesdb_index_option_list.  index_type() checks per-index option
  first, falls back to table-level.  create() and
  inplace_alter_table() apply per-index USE_BTREE when creating
  secondary index CFs.  Syntax: KEY idx_a (a) USE_BTREE=1.

Issue #78: index_type() override

  Returns 'LSM' for default block-based tables and 'BTREE' for
  USE_BTREE=1 tables.  SHOW KEYS now displays correct index type.

Issue #74: ANALYZE TABLE cardinality

  Samples up to 100K entries from each secondary index CF, counts
  distinct key prefixes to compute rec_per_key.  Before: EXPLAIN
  showed rows=1 for any index lookup.  After: accurate estimates.

Issue #82: Table option default variables

  5 session-scoped dynamic system variables via HA_TOPTION_SYSVAR:
    tidesdb_default_compression        (NONE/SNAPPY/LZ4/ZSTD/LZ4_FAST)
    tidesdb_default_write_buffer_size   (default 32MB)
    tidesdb_default_bloom_filter        (default ON)
    tidesdb_default_use_btree           (default OFF)
    tidesdb_default_block_indexes       (default ON)

Issue #81: Conflict information

  tidesdb_print_all_conflicts global dynamic sysvar (like
  innodb_print_all_deadlocks).  When ON, every TDB_ERR_CONFLICT logs
  a warning.  Last conflict info displayed in SHOW ENGINE TIDESDB
  STATUS under 'Conflicts' section.

--- New MTR tests ---

  tidesdb_replace_iodku     - REPLACE INTO + IODKU with PK, secondary, unique
  tidesdb_index_stats       - issue #78 index_type + issue #74 cardinality
  tidesdb_isolation          - session isolation level mapping
  tidesdb_engine_status     - SHOW ENGINE TIDESDB STATUS
  tidesdb_per_index_btree   - per-index USE_BTREE
  tidesdb_data_home_dir     - sysvar visibility and read-only enforcement
  tidesdb_insert_conflict   - INSERT vs INSERT conflict (awaits library fix)

--- Known issues ---

  Issue #77: concurrent UPDATE+DELETE crashes the server (SIGSEGV) in
  core MVCC conflict detection.  Not fixable in the plugin layer;
  requires changes to src/tidesdb.c.

  Issue #83: INSERT vs INSERT conflict not detected.  Library-level
  bug; tidesdb_txn_get within a transaction's snapshot cannot see
  uncommitted writes from other transactions by design.  Enforcement
  must happen at commit time via TDB_ERR_CONFLICT.  Fix confirmed
  coming in next library release.

  UNIQUE secondary index check still creates a full merge-heap
  iterator via tidesdb_iter_new per UNIQUE index per INSERT when
  skip_unique_check is off.  A future tidesdb_txn_exists_prefix() API
  would avoid building the full merge heap.
@guycipher guycipher self-assigned this Mar 11, 2026
* Deleted 20 redundant .opt files that only contained --plugin-load-add=ha_tidesdb + --plugin-maturity=unknown
* Stripped redundant lines from the 7 remaining .opt files that had actual unique options (encryption keys, loose test vars)
@guycipher guycipher merged commit d0979b5 into master Mar 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant