feat: add 'replication_lag' support for PostgreSQL monitoring#5422
feat: add 'replication_lag' support for PostgreSQL monitoring#5422renecannao merged 19 commits intov3.0from
Conversation
This TAP test requires infra not currently present in the branch. The infra is yet to be adapted to work as TAP group.
The feature has been disabled until 'pt-heartbeat' support for PostgreSQL is fixed. See https://perconadev.atlassian.net/browse/PT-2030.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds PostgreSQL replication-lag monitoring: new repl_lag task type, scheduling and worker integration, query handling (including pt-heartbeat), SQLite logging table and metrics, thread-local replication-lag variables, shunning logic adjustments, and TAP test infra for replication-lag scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler as Scheduler/Worker
participant Monitor as PgSQL_Monitor
participant Async as Async Task Handler
participant PgDB as PostgreSQL
participant LogDB as SQLite3
participant Manager as HostGroups_Manager
Scheduler->>Monitor: compute_next_intvs()/build_intv_batches -> schedule repl_lag
Scheduler->>Async: start async repl_lag task
Async->>PgDB: execute repl_lag query (regular or pt-heartbeat)
PgDB-->>Async: return lag_value or error/timeout
Async->>Monitor: handle_async_check_cont() with result
Monitor->>Monitor: proc_task_state() -> perf_repl_lag_actions()
Monitor->>LogDB: update_repl_lag_table() (insert log row)
Monitor->>Manager: replication_lag_action_inner() evaluate counts/thresholds
alt lag exceeds threshold count
Manager->>Manager: mark server SHUNNED
else
Manager->>Manager: clear SHUNNED if needed
end
Monitor->>Monitor: increment repl_lag_check_OK/ERR counters
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @JavierJF, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances PostgreSQL monitoring capabilities by integrating replication lag detection. It enables ProxySQL to actively monitor the replication status of PostgreSQL instances, apply configurable thresholds, and perform shunning actions when replication lag exceeds defined limits. This feature aims to improve the reliability and availability of PostgreSQL deployments managed by ProxySQL, ensuring that client connections are not routed to severely lagged replicas. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive PostgreSQL replication lag monitoring capabilities. Key changes include adding a new SQLite table (pgsql_server_replication_lag_log) to store replication lag data, new counters (repl_lag_check_ERR, repl_lag_check_OK), and a new error code (ER_PROXYSQL_REPL_LAG_TIMEOUT). The monitoring logic now includes specific SQL queries for replication lag detection, with an option to use Percona Heartbeat. New configuration variables (monitor_replication_lag_interval_window, monitor_replication_lag_use_percona_heartbeat) have been added and integrated into the thread handler and global status reporting. The code also includes mechanisms for scheduling replication lag checks, processing results, updating the SQLite log table, and performing hostgroup actions based on the detected lag. A new test group and test case (test_pgsql_replication_lag-t) have been added to validate this functionality, along with Docker Compose infrastructure for PostgreSQL primary-replica setup. Review comments highlighted the need for const-correctness in error counter updates, clarified the incomplete REPLICATION_LAG_QUERY_PT_HEARTBEAT query, addressed inconsistencies in handling the monitor_replication_lag_use_percona_heartbeat variable, and suggested avoiding system() calls in tests for security reasons.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/PgSQL_HostGroups_Manager.cpp (1)
2719-2731:⚠️ Potential issue | 🟠 MajorPgSQL implementation ignores the
monitor_replication_lag_group_by_hostconfiguration variable.The flag is registered as a configuration variable in
PgSQL_Thread.cpp(line 2176, initialized and loaded from config at line 4035), butPgSQL_HostGroups_Manager::replication_lag_action()hard-codes the condition tofalsewith the variable reference commented out. This diverges from the MySQL implementation (MySQL_HostGroups_Manager.cpp:2751), which correctly evaluates the variable. Either fix the PgSQL version to use the actual variable like MySQL does, or remove the flag registration and document the intentional divergence.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/PgSQL_HostGroups_Manager.cpp` around lines 2719 - 2731, In PgSQL_HostGroups_Manager::replication_lag_action(), restore use of the actual configuration flag instead of the hard-coded "false": replace the commented-out condition with the real symbol (monitor_replication_lag_group_by_host or the exact variable name registered in PgSQL_Thread.cpp) so the code branches like the MySQL implementation; keep the existing calls to MyHGC_find(...) and the loop over MyHostGroups with replication_lag_action_inner(...) intact, or if the flag is intentionally unsupported, remove its registration in PgSQL_Thread.cpp and document the divergence—do not leave the flag registered while the branch is forced to false.lib/PgSQL_Monitor.cpp (1)
654-715:⚠️ Potential issue | 🟠 MajorAdd NULL and precision checks for replication lag values.
The code currently lacks a NULL check before parsing replication lag. When PostgreSQL is not in standby recovery,
pg_last_xact_replay_timestamp()returns NULL, and the entire CASE expression evaluates to NULL. UsingPQgetvalue()withoutPQgetisnull()followed bystd::atoi()treats NULL as 0, preventing distinction between "no lag" and "not in recovery"—this masks problematic replicas and affects shunning decisions. Additionally, fractional seconds are truncated byint32_tandatoi(), losing precision.Add
PQgetisnull()check and parse withstrtod()(or treat NULL as an error) to properly handle these cases before passing results toreplication_lag_action().Suggested fix for NULL/precision handling
- } else if (st.task.type == task_type_t::repl_lag) { - const char* value_str { PQgetvalue(res, 0, 0) }; - int32_t value { std::atoi(value_str) }; - - set_finish_st(st, ASYNC_QUERY_END, - op_result_t { - new repl_lag_res_t { value }, - [] (void* v) { delete static_cast<repl_lag_res_t*>(v); } - } - ); + } else if (st.task.type == task_type_t::repl_lag) { + if (PQgetisnull(res, 0, 0)) { + set_failed_st( + st, ASYNC_QUERY_FAILED, + mf_unique_ptr<char>(strdup("replication_lag is NULL")) + ); + } else { + const char* value_str { PQgetvalue(res, 0, 0) }; + char* end = nullptr; + double value_d = std::strtod(value_str, &end); + if (end == value_str) { + set_failed_st( + st, ASYNC_QUERY_FAILED, + mf_unique_ptr<char>(strdup("invalid replication_lag value")) + ); + } else { + int32_t value { static_cast<int32_t>(value_d) }; + set_finish_st(st, ASYNC_QUERY_END, + op_result_t { + new repl_lag_res_t { value }, + [] (void* v) { delete static_cast<repl_lag_res_t*>(v); } + } + ); + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/PgSQL_Monitor.cpp` around lines 654 - 715, The repl_lag branch currently calls PQgetvalue + std::atoi without NULL checking, losing NULL-vs-zero semantics and fractional precision; update the repl_lag handling (the block under "else if (st.task.type == task_type_t::repl_lag)") to first call PQgetisnull(res, 0, 0) and treat NULL as an error case (call set_failed_st with a clear error message via mf_unique_ptr<char> / strdup) or otherwise propagate a distinct "not in recovery" result, and replace std::atoi with strtod to parse fractional seconds into a floating type (update repl_lag_res_t to hold double if needed) before calling set_finish_st so replication_lag_action receives accurate non-NULL, high‑precision values.
🧹 Nitpick comments (3)
test/tap/groups/pgsql-repl/Dockerfile (1)
3-3: Trim apt-get footprint.Line 3 should use
--no-install-recommendsand clean apt lists to keep the image lean.♻️ Suggested change
-RUN apt-get update && apt-get install -y iproute2 +RUN apt-get update && apt-get install -y --no-install-recommends iproute2 \ + && rm -rf /var/lib/apt/lists/*🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/Dockerfile` at line 3, Update the RUN line that installs iproute2 to reduce image size: modify the apt-get install invocation in the Dockerfile (the RUN apt-get update && apt-get install -y iproute2 command) to include --no-install-recommends and then clean up apt caches (e.g., apt-get clean and remove /var/lib/apt/lists/*) in the same RUN layer so the image stays lean.test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash (1)
7-7: Quote therealpath/dirnameexpansion to avoid word-splitting.Unquoted command substitutions can mis-handle paths containing spaces and are flagged by shellcheck.
🔧 Suggested fix
-PROXY_CONF_DIR=$(dirname $(realpath $0))/../conf/proxysql +PROXY_CONF_DIR="$(dirname "$(realpath "$0")")/../conf/proxysql"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` at line 7, The assignment to PROXY_CONF_DIR uses unquoted command substitution which can break on paths with spaces; update the line that computes PROXY_CONF_DIR (the $(realpath $0) and dirname usage) to quote the command substitution results and the $0 expansion so the dirname/realpath output is treated as a single word (i.e., use quoted "$(realpath "$0")" and quote the overall substitution when assigning PROXY_CONF_DIR).lib/PgSQL_Thread.cpp (1)
1305-1306: Avoid a silent no-op for disabled pt-heartbeat config.
Since internal retrieval is disabled, the admin setter still accepts non-empty values that won’t propagate to worker threads. Consider rejecting non-empty values (or warning) while PT-2030 is unresolved to avoid confusing operators.♻️ Suggested guard in
set_variable()- if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) { + if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) { + if (vallen > 0) { + proxy_warning("pgsql-monitor_replication_lag_use_percona_heartbeat is disabled until PT-2030 is resolved\n"); + return false; + } if (vallen == 0) { // empty string🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/PgSQL_Thread.cpp` around lines 1305 - 1306, The disabled pt-heartbeat config is a silent no-op: update set_variable() to guard against assigning non-empty values to "monitor_replication_lag_use_percona_heartbeat" by detecting when name == "monitor_replication_lag_use_percona_heartbeat" and value is non-empty, and then reject the change (return an error/validation failure) or at minimum log/warn and force it to empty; reference the variables.monitor_replication_lag_use_percona_heartbeat symbol and ensure the setter does not accept/retain non-empty inputs while PT-2030 remains unresolved so operators get immediate feedback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 7: The assignment for POSTGRE_SETUP_DIR uses unquoted command
substitutions and $0 which can break on paths with spaces; update the derivation
to quote the inner expansions so that realpath and dirname receive "$0" and the
command substitutions are quoted (i.e., use dirname on the quoted realpath of
"$0" and quote the whole substitution when assigning to POSTGRE_SETUP_DIR), and
ensure any later uses of POSTGRE_SETUP_DIR are also quoted to avoid word
splitting.
- Around line 15-23: The probe loop is failing because the psql invocation uses
invalid SQL and the timeout message references the wrong DB; update the psql
probe command (the line invoking PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql
-h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" > /dev/null 2>&1) to execute
a valid probe such as -c "SELECT 1;" (or another simple, valid SQL) and change
the timeout error echo (the line checking WAITED vs TIMEOUT) to say "PostgreSQL"
instead of "MySQL"; keep the existing variables RC, WAITED, TIMEOUT, PGSQL_PWD,
PGSQL_HOST, PGSQL_PORT, PGSQL_DB and redirecting output as before.
In `@test/tap/tap/utils.h`:
- Around line 264-270: The CHECK_EXT_VAL macro logs the wrong error reporter for
SQLite-derived ext_val_t values; add a parallel macro (e.g.,
CHECK_EXT_VAL_SQLITE(conn, val)) that mirrors CHECK_EXT_VAL but calls
sq3_get_ext_val_err(val) (and uses sqlite3_errstr) when val originates from
sq3_query_ext_val, then update the two test call sites that use
CHECK_EXT_VAL(admin, last_id) for SQLite results to use
CHECK_EXT_VAL_SQLITE(admin, last_id) so diagnostics show the real SQLite3 error.
In `@test/tap/tests/test_pgsql_replication_lag-t.cpp`:
- Around line 93-124: The loop waiting for the replica to become ONLINE can time
out and let the test continue, causing flaky assertions; after using
mysql_query_ext_val (q_st_check) and CHECK_EXT_VAL to read is_online, if the
wait loop exceeds the timeout (j > timeout) or the final is_online.val is false,
fail the test immediately with a clear failure message instead of breaking and
continuing; update the code around q_st_check / is_online / mysql_query_ext_val
/ CHECK_EXT_VAL to check post-loop whether the replica is ONLINE and call the
test failure routine (or ASSERT/FAIL equivalent) with a descriptive message so
the test fails fast when replication never becomes ONLINE.
In `@test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh`:
- Around line 7-8: The vintf resolution fails due to nested/unescaped quoting
inside the docker exec --user root ... sh -c command; update the command that
sets vintf so the container receives the literal $rsubnet and the internal
grep/echo quotes are preserved (e.g., stop using an outer double-quoted sh -c
string that interpolates $rsubnet locally). Specifically, modify the docker exec
invocation that assigns vintf to escape the dollar sign (use \$rsubnet) or
switch to a single-quoted sh -c argument and keep the inner "$rsubnet" and
quoted regex intact so echo, xargs, ip route show to match, grep -E "$rsubnet",
and cut -d' ' -f3 run inside the container correctly.
In `@test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh`:
- Around line 7-8: The quoted sh -c string for vintf is broken by nested double
quotes around $rsubnet; fix by escaping the inner double quotes so the command
passed to docker remains a single quoted string (e.g. change grep -E "$rsubnet"
to grep -E \"$rsubnet\" within the sh -c argument) or alternatively restructure
the quoting (use single quotes for the sh -c wrapper and concatenate the
expanded $rsubnet) so the docker exec sh -c "..." invocation, the vintf
assignment, and the variables ($rsubnet) are preserved and the pipeline (echo |
xargs ip route show to match | grep -E ... | cut -d' ' -f3) returns the expected
interface.
In `@test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp`:
- Line 203: The test uses CHECK_EXT_VAL(admin, last_id) after sq3_query_ext_val
which may fail with SQLite-specific errors, but CHECK_EXT_VAL currently logs
mysql_error(admin) only; update the check to detect SQLite failures from last_id
(the return from sq3_query_ext_val) and log the SQLite error message instead:
either enhance the CHECK_EXT_VAL macro/utility in utils.h to branch when the DB
backend is SQLite (call sqlite3_errmsg on the sqlite handle stored in admin or
the relevant sqlite DB pointer) or replace this test call with an explicit check
that inspects last_id and, on non-MySQL errors, calls
sqlite3_errmsg(admin->sqlite_db) (or the correct sqlite handle) so the actual
SQLite error is logged when sq3_query_ext_val fails. Ensure references to
CHECK_EXT_VAL, last_id, sq3_query_ext_val and utils.h are updated accordingly.
---
Outside diff comments:
In `@lib/PgSQL_HostGroups_Manager.cpp`:
- Around line 2719-2731: In PgSQL_HostGroups_Manager::replication_lag_action(),
restore use of the actual configuration flag instead of the hard-coded "false":
replace the commented-out condition with the real symbol
(monitor_replication_lag_group_by_host or the exact variable name registered in
PgSQL_Thread.cpp) so the code branches like the MySQL implementation; keep the
existing calls to MyHGC_find(...) and the loop over MyHostGroups with
replication_lag_action_inner(...) intact, or if the flag is intentionally
unsupported, remove its registration in PgSQL_Thread.cpp and document the
divergence—do not leave the flag registered while the branch is forced to false.
In `@lib/PgSQL_Monitor.cpp`:
- Around line 654-715: The repl_lag branch currently calls PQgetvalue +
std::atoi without NULL checking, losing NULL-vs-zero semantics and fractional
precision; update the repl_lag handling (the block under "else if (st.task.type
== task_type_t::repl_lag)") to first call PQgetisnull(res, 0, 0) and treat NULL
as an error case (call set_failed_st with a clear error message via
mf_unique_ptr<char> / strdup) or otherwise propagate a distinct "not in
recovery" result, and replace std::atoi with strtod to parse fractional seconds
into a floating type (update repl_lag_res_t to hold double if needed) before
calling set_finish_st so replication_lag_action receives accurate non-NULL,
high‑precision values.
---
Nitpick comments:
In `@lib/PgSQL_Thread.cpp`:
- Around line 1305-1306: The disabled pt-heartbeat config is a silent no-op:
update set_variable() to guard against assigning non-empty values to
"monitor_replication_lag_use_percona_heartbeat" by detecting when name ==
"monitor_replication_lag_use_percona_heartbeat" and value is non-empty, and then
reject the change (return an error/validation failure) or at minimum log/warn
and force it to empty; reference the
variables.monitor_replication_lag_use_percona_heartbeat symbol and ensure the
setter does not accept/retain non-empty inputs while PT-2030 remains unresolved
so operators get immediate feedback.
In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Line 7: The assignment to PROXY_CONF_DIR uses unquoted command substitution
which can break on paths with spaces; update the line that computes
PROXY_CONF_DIR (the $(realpath $0) and dirname usage) to quote the command
substitution results and the $0 expansion so the dirname/realpath output is
treated as a single word (i.e., use quoted "$(realpath "$0")" and quote the
overall substitution when assigning PROXY_CONF_DIR).
In `@test/tap/groups/pgsql-repl/Dockerfile`:
- Line 3: Update the RUN line that installs iproute2 to reduce image size:
modify the apt-get install invocation in the Dockerfile (the RUN apt-get update
&& apt-get install -y iproute2 command) to include --no-install-recommends and
then clean up apt caches (e.g., apt-get clean and remove /var/lib/apt/lists/*)
in the same RUN layer so the image stays lean.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
include/PgSQL_Monitor.hppinclude/PgSQL_Thread.hinclude/proxysql_structs.hlib/PgSQL_HostGroups_Manager.cpplib/PgSQL_Monitor.cpplib/PgSQL_Thread.cpptest/tap/groups/groups.jsontest/tap/groups/pgsql-repl/Dockerfiletest/tap/groups/pgsql-repl/README.mdtest/tap/groups/pgsql-repl/bin/docker-pgsql-post.bashtest/tap/groups/pgsql-repl/bin/docker-proxy-post.bashtest/tap/groups/pgsql-repl/conf/postgres/00_init.sqltest/tap/groups/pgsql-repl/conf/proxysql/config.sqltest/tap/groups/pgsql-repl/constantstest/tap/groups/pgsql-repl/docker-compose-destroy.bashtest/tap/groups/pgsql-repl/docker-compose-init.bashtest/tap/groups/pgsql-repl/docker-compose.yamltest/tap/groups/pgsql-repl/post-proxysql.bashtest/tap/groups/pgsql-repl/pre-proxysql.bashtest/tap/groups/pgsql-repl/scripts/create_test_tables.sqltest/tap/tap/Makefiletest/tap/tap/utils.htest/tap/tests/test_match_eof_conn_cap.cpptest/tap/tests/test_pgsql_replication_lag-t.cpptest/tap/tests/test_pgsql_replication_lag-t.envtest/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.shtest/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.shtest/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
Outdated
Show resolved
Hide resolved
test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash (1)
20-20:⚠️ Potential issue | 🔴 CriticalInvalid
SELECTprobe still causes the wait loop to always timeout.
-c"SELECT"is not valid SQL —psqlwill error on every iteration,RCnever becomes 0, and the script exits after 300 s unconditionally. The error-message fix from the earlier review was applied, but this part of the same issue was not.🐛 Proposed fix
- PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" > /dev/null 2>&1 + PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h"$PGSQL_HOST" -p"$PGSQL_PORT" -U"$PGSQL_DB" -c "SELECT 1" > /dev/null 2>&1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash` at line 20, The probe command using psql currently runs -c"SELECT" which is not valid SQL and always fails; update the probe in the docker-pgsql-post script (the PGPASSWORD... psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" invocation) to use a valid, lightweight query such as -c "SELECT 1" (or "SELECT 1;") so the return code becomes 0 when the server is ready and the wait loop can succeed.test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp (1)
203-203: Correct macro now used for SQLite3-originatedext_val_t.
last_idis produced bysq3_query_ext_val, soSQ3_CHECK_EXT_VALcorrectly routes tosqlite3_errstrfor error reporting. Same fix applied at line 288. This resolves the diagnostic mismatch flagged in the previous review.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp` at line 203, Replace the incorrect ext_val check macro with the SQLite3-specific one: use SQ3_CHECK_EXT_VAL when validating the result stored in last_id produced by sq3_query_ext_val so error reporting uses sqlite3_errstr; update the same usage at the other occurrence (previously fixed at line ~288) to ensure consistent SQLite3-originated ext_val_t handling.
🧹 Nitpick comments (3)
test/tap/tap/utils.h (2)
259-270:@param connmissing fromCHECK_EXT_VALdoc block.The updated macro signature accepts
connbut the Doxygen block has no@param connentry, leaving the parameter undocumented.📝 Proposed fix
/** * `@brief` Utility one-liner macro to check for query failure on a 'ext_val_t<T>'. + * `@param` conn An already opened MYSQL connection used to retrieve the error string. * `@param` val The 'ext_val_t<T>' to be checked. * `@return` In case of failure, 'EXIT_FAILURE' after logging the error, continues otherwise. */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/tap/utils.h` around lines 259 - 270, The Doxygen block for the CHECK_EXT_VAL macro is missing documentation for the new conn parameter; update the comment above the macro to add an "@param conn" entry describing the connection/context parameter passed to get_ext_val_err (used inside CHECK_EXT_VAL) so all parameters (conn and val) are documented and consistent with the macro signature.
272-283:SQ3_CHECK_EXT_VALdoc is an unedited copy ofCHECK_EXT_VAL's.The brief and description give no indication this macro is intended for
sq3_query_ext_val-originated values, making it easy to misuse.📝 Proposed fix
/** - * `@brief` Utility one-liner macro to check for query failure on a 'ext_val_t<T>'. + * `@brief` Utility one-liner macro to check for query failure on a SQLite3-originated 'ext_val_t<T>'. + * `@details` Use this macro when 'val' was produced by 'sq3_query_ext_val'. For MySQL-backed values + * use 'CHECK_EXT_VAL(conn, val)' instead. * `@param` val The 'ext_val_t<T>' to be checked. * `@return` In case of failure, 'EXIT_FAILURE' after logging the error, continues otherwise. */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/tap/utils.h` around lines 272 - 283, The macro SQ3_CHECK_EXT_VAL has copied documentation from CHECK_EXT_VAL and doesn't state it's intended only for ext_val_t values produced by sq3_query_ext_val, which can cause misuse; update the comment for SQ3_CHECK_EXT_VAL to explicitly state it should be used only with ext_val_t results originating from sq3_query_ext_val (or other sq3_* query helpers), describe the behavior (logs via diag and returns EXIT_FAILURE on err), and mention it uses sq3_get_ext_val_err to format the error; ensure the `@brief/`@param/@return wording differentiates this macro from CHECK_EXT_VAL so callers know the precondition.lib/PgSQL_Thread.cpp (1)
1305-1306: Disabled heartbeat variable still has an active setter path.These changes disable getter/refresh exposure, but
set_variable()still acceptsmonitor_replication_lag_use_percona_heartbeat(Line 1715+), making it effectively write-only/no-op. Prefer rejecting updates explicitly until PT-2030 is fixed.♻️ Suggested guard in setter
- if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) { - if (vallen == 0) { // empty string - free(variables.monitor_replication_lag_use_percona_heartbeat); - variables.monitor_replication_lag_use_percona_heartbeat = strdup((value)); - return true; - } - else { - re2::RE2::Options* opt2 = new re2::RE2::Options(RE2::Quiet); - opt2->set_case_sensitive(false); - char* patt = (char*)"`?([a-z\\d_]+)`?\\.`?([a-z\\d_]+)`?"; - RE2* re = new RE2(patt, *opt2); - bool rc = false; - rc = RE2::FullMatch(value, *re); - delete re; - delete opt2; - if (rc) { - free(variables.monitor_replication_lag_use_percona_heartbeat); - variables.monitor_replication_lag_use_percona_heartbeat = strdup(value); - return true; - } - else { - proxy_error("%s is an invalid value for %s, not matching regex \"%s\"\n", value, name, patt); - } - } - return false; - } + if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) { + proxy_error( + "%s is currently disabled until pt-heartbeat PostgreSQL support is fixed (PT-2030)\n", + name + ); + return false; + }Also applies to: 1600-1601, 4023-4026
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/PgSQL_Thread.cpp` around lines 1305 - 1306, The disabled heartbeat variable is still accepted by the setter, so modify set_variable (the function that handles variable updates and currently accepts "monitor_replication_lag_use_percona_heartbeat") to explicitly reject or return an error for that name when PT-2030 is unresolved; locate the branch in set_variable that checks variable names and add a guard that denies updates for "monitor_replication_lag_use_percona_heartbeat" (and the other disabled variable occurrences noted around the earlier mentions) with a clear error/return path so the variable is not writable while its getter/refresh is disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 32: The psql invocation uses unquoted variables causing word-splitting
(SC2086); update the command that runs create_test_tables.sql so environment and
flags are properly quoted: quote $PGSQL_PWD in PGPASSWORD assignment if needed,
and wrap $PGSQL_HOST, $PGSQL_PORT, $PGSQL_DB and especially $POSTGRE_SETUP_DIR
in double quotes in the psql command and the input redirection (the line
invoking psql and redirecting < $POSTGRE_SETUP_DIR/create_test_tables.sql) so
paths with spaces are handled safely.
In `@test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh`:
- Line 8: The pipeline that finds the interface uses grep -E with the variable
rsubnet ("ip route show to match \"$rsubnet\" | grep -E \"$rsubnet\" | cut -d' '
-f3"), which treats dots as regex wildcards and can overmatch; replace grep -E
"$rsubnet" with fixed-string matching (grep -F "$rsubnet" or fgrep "$rsubnet")
so the route filtering uses a literal subnet match while leaving the surrounding
command and cut -d' ' -f3 unchanged.
---
Duplicate comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 20: The probe command using psql currently runs -c"SELECT" which is not
valid SQL and always fails; update the probe in the docker-pgsql-post script
(the PGPASSWORD... psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT"
invocation) to use a valid, lightweight query such as -c "SELECT 1" (or "SELECT
1;") so the return code becomes 0 when the server is ready and the wait loop can
succeed.
In `@test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp`:
- Line 203: Replace the incorrect ext_val check macro with the SQLite3-specific
one: use SQ3_CHECK_EXT_VAL when validating the result stored in last_id produced
by sq3_query_ext_val so error reporting uses sqlite3_errstr; update the same
usage at the other occurrence (previously fixed at line ~288) to ensure
consistent SQLite3-originated ext_val_t handling.
---
Nitpick comments:
In `@lib/PgSQL_Thread.cpp`:
- Around line 1305-1306: The disabled heartbeat variable is still accepted by
the setter, so modify set_variable (the function that handles variable updates
and currently accepts "monitor_replication_lag_use_percona_heartbeat") to
explicitly reject or return an error for that name when PT-2030 is unresolved;
locate the branch in set_variable that checks variable names and add a guard
that denies updates for "monitor_replication_lag_use_percona_heartbeat" (and the
other disabled variable occurrences noted around the earlier mentions) with a
clear error/return path so the variable is not writable while its getter/refresh
is disabled.
In `@test/tap/tap/utils.h`:
- Around line 259-270: The Doxygen block for the CHECK_EXT_VAL macro is missing
documentation for the new conn parameter; update the comment above the macro to
add an "@param conn" entry describing the connection/context parameter passed to
get_ext_val_err (used inside CHECK_EXT_VAL) so all parameters (conn and val) are
documented and consistent with the macro signature.
- Around line 272-283: The macro SQ3_CHECK_EXT_VAL has copied documentation from
CHECK_EXT_VAL and doesn't state it's intended only for ext_val_t values produced
by sq3_query_ext_val, which can cause misuse; update the comment for
SQ3_CHECK_EXT_VAL to explicitly state it should be used only with ext_val_t
results originating from sq3_query_ext_val (or other sq3_* query helpers),
describe the behavior (logs via diag and returns EXIT_FAILURE on err), and
mention it uses sq3_get_ext_val_err to format the error; ensure the
`@brief/`@param/@return wording differentiates this macro from CHECK_EXT_VAL so
callers know the precondition.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
test/tap/proxysql-ca.pemis excluded by!**/*.pemtest/tap/proxysql-cert.pemis excluded by!**/*.pemtest/tap/proxysql-key.pemis excluded by!**/*.pem
📒 Files selected for processing (7)
lib/PgSQL_Thread.cpptest/tap/groups/pgsql-repl/bin/docker-pgsql-post.bashtest/tap/groups/pgsql-repl/bin/docker-proxy-post.bashtest/tap/tap/utils.htest/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.shtest/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.shtest/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
- test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
Show resolved
Hide resolved
0b0b679 to
70bf127
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash (1)
15-15: Off-by-one: timeout fires after 301 seconds, not 300.
$WAITED -gt $TIMEOUTwithTIMEOUT=300allows WAITED to reach 301 before the guard triggers (iterations 0 through 300 inclusive = 301 sleeps).♻️ Proposed fix
- if [ $WAITED -gt $TIMEOUT ]; then + if [ $WAITED -ge $TIMEOUT ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` at line 15, The timeout guard using the test "[ $WAITED -gt $TIMEOUT ]" fires one second late (allows WAITED==TIMEOUT+1); update the condition to use greater-or-equal so the timeout fires exactly at TIMEOUT (replace the check in the if statement that references WAITED and TIMEOUT with a check using -ge), i.e., change the conditional controlling the timeout guard in docker-proxy-post.bash from -gt to -ge.lib/PgSQL_Thread.cpp (1)
333-335: Make disabled heartbeat handling consistent in the setter path too.
monitor_replication_lag_use_percona_heartbeatis disabled in exposure/getters, butset_variable()still accepts and validates it. Consider explicitly rejecting writes to avoid partial/ambiguous behavior.♻️ Proposed change
@@ - if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) { - if (vallen == 0) { // empty string - free(variables.monitor_replication_lag_use_percona_heartbeat); - variables.monitor_replication_lag_use_percona_heartbeat = strdup((value)); - return true; - } - else { - re2::RE2::Options* opt2 = new re2::RE2::Options(RE2::Quiet); - opt2->set_case_sensitive(false); - char* patt = (char*)"`?([a-z\\d_]+)`?\\.`?([a-z\\d_]+)`?"; - RE2* re = new RE2(patt, *opt2); - bool rc = false; - rc = RE2::FullMatch(value, *re); - delete re; - delete opt2; - if (rc) { - free(variables.monitor_replication_lag_use_percona_heartbeat); - variables.monitor_replication_lag_use_percona_heartbeat = strdup(value); - return true; - } - else { - proxy_error("%s is an invalid value for %s, not matching regex \"%s\"\n", value, name, patt); - } - } - return false; - } + if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) { + proxy_error( + "%s is currently disabled until pt-heartbeat PostgreSQL support is fixed (PT-2030)\n", + name + ); + return false; + }Also applies to: 1305-1306, 1600-1601, 4023-4026
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/PgSQL_Thread.cpp` around lines 333 - 335, The getter/exposure for monitor_replication_lag_use_percona_heartbeat was disabled but set_variable() still accepts/validates it; update set_variable() to explicitly reject writes for the disabled symbol by adding a top-level check for "monitor_replication_lag_use_percona_heartbeat" that returns an error/false (and logs/returns a clear message) instead of validating or applying the value, and mirror the same explicit rejection logic in the other setter/validation spots identified (the similar validation paths around the symbols on the other occurrences) so writes are consistently refused across set_variable() and related validation functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Line 20: Quote all variable expansions and stop passing the password on the
argv: replace unquoted uses like mysql -h$ADMIN_HOST -P$ADMIN_PORT -u$ADMIN_USER
-p$ADMIN_PWD -e"\s" with a quoted-invocation that supplies the password via an
environment variable (e.g. MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST"
-P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s") and quote any file paths such as
"$PROXY_CONF_DIR/config.sql"; ensure every $VAR is wrapped in double quotes to
prevent word-splitting/globbing and move the password into MYSQL_PWD (or use
--password and omit the value to prompt) so the secret is not exposed in the
process table.
- Around line 14-25: The loop currently checks while [ $RC -eq 1 ] which exits
on any non-1 failure; change the loop condition to treat every non-zero exit as
"not ready" (e.g., while [ $RC -ne 0 ]) so the loop keeps retrying until mysql
(the mysql -h$ADMIN_HOST -P$ADMIN_PORT ... -e"\s") returns 0 or TIMEOUT is
reached; keep the existing RC, WAITED and TIMEOUT variables and update the
condition to use numeric not-equal to zero semantics and quote variables where
appropriate.
In `@test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh`:
- Around line 5-16: The script computes rsubnet and vintf then runs tc against
$vintf but does not validate that vintf resolved to a single non-empty
interface, so empty or multi-line values cause late/noisy failures; update the
block around vintf, tbf_count and the tc invocation to validate rsubnet and
vintf immediately after resolving them (ensure rsubnet is non-empty and vintf
contains exactly one non-empty line/no whitespace/newlines), log a clear error
and exit non‑zero if validation fails, and only then proceed to compute
tbf_count and run tc qdisc del on the validated $vintf.
---
Nitpick comments:
In `@lib/PgSQL_Thread.cpp`:
- Around line 333-335: The getter/exposure for
monitor_replication_lag_use_percona_heartbeat was disabled but set_variable()
still accepts/validates it; update set_variable() to explicitly reject writes
for the disabled symbol by adding a top-level check for
"monitor_replication_lag_use_percona_heartbeat" that returns an error/false (and
logs/returns a clear message) instead of validating or applying the value, and
mirror the same explicit rejection logic in the other setter/validation spots
identified (the similar validation paths around the symbols on the other
occurrences) so writes are consistently refused across set_variable() and
related validation functions.
In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Line 15: The timeout guard using the test "[ $WAITED -gt $TIMEOUT ]" fires one
second late (allows WAITED==TIMEOUT+1); update the condition to use
greater-or-equal so the timeout fires exactly at TIMEOUT (replace the check in
the if statement that references WAITED and TIMEOUT with a check using -ge),
i.e., change the conditional controlling the timeout guard in
docker-proxy-post.bash from -gt to -ge.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
test/tap/proxysql-ca.pemis excluded by!**/*.pemtest/tap/proxysql-cert.pemis excluded by!**/*.pemtest/tap/proxysql-key.pemis excluded by!**/*.pem
📒 Files selected for processing (7)
lib/PgSQL_Thread.cpptest/tap/groups/pgsql-repl/bin/docker-pgsql-post.bashtest/tap/groups/pgsql-repl/bin/docker-proxy-post.bashtest/tap/tap/utils.htest/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.shtest/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.shtest/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
- test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
Show resolved
Hide resolved
This isn't for functional purposes, just for consistency in reporting the motivation for disabling the feature in the codebase.
70bf127 to
26c6eab
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash (2)
14-25:⚠️ Potential issue | 🟠 MajorFix readiness-loop exit condition and timeout boundary.
On Lines 14-25, the loop currently retries only when
RC == 1. Other non-zero failures (e.g., bad options, missing client, transient errors) break out early and continue before readiness is confirmed. Also,-gtcauses an off-by-one timeout.Suggested patch
-while [ $RC -eq 1 ]; do - if [ $WAITED -gt $TIMEOUT ]; then +while [ "$RC" -ne 0 ]; do + if [ "$WAITED" -ge "$TIMEOUT" ]; then echo "[ERROR] Timeout of $TIMEOUT seconds reached while connecting to ProxySQL" exit 1 else printf "." - mysql -h$ADMIN_HOST -P$ADMIN_PORT -u$ADMIN_USER -p$ADMIN_PWD -e"\s" > /dev/null 2>&1 + mysql -h"$ADMIN_HOST" -P"$ADMIN_PORT" -u"$ADMIN_USER" -p"$ADMIN_PWD" -e "\s" > /dev/null 2>&1 RC=$? WAITED=$((WAITED+1)) sleep 1 fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` around lines 14 - 25, The readiness loop only retries when RC == 1 and uses -gt for timeout, which misses other non-zero failures and has an off-by-one; change the loop to retry while the mysql command exit code is non-zero (e.g., loop on RC -ne 0 or use an infinite loop and break on RC == 0) and change the timeout check to use -ge against WAITED and TIMEOUT so the loop exits exactly when WAITED reaches TIMEOUT; update the block that sets RC (the mysql call) and the conditional that tests WAITED to use these new checks (variables: RC, WAITED, TIMEOUT, ADMIN_HOST/PORT/USER/PWD) so readiness only returns on success and timeout boundary is correct.
20-32:⚠️ Potential issue | 🟡 MinorAvoid exposing credentials in argv/logs and quote all variable expansions.
Lines 20 and 32 still pass password via
-p..., and Line 31 enablesset -x, which can leak secrets into CI logs. Variable expansions are also unquoted in several places.Suggested patch
- mysql -h$ADMIN_HOST -P$ADMIN_PORT -u$ADMIN_USER -p$ADMIN_PWD -e"\s" > /dev/null 2>&1 + MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST" -P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s" > /dev/null 2>&1 @@ -set -x -mysql --prompt="admin> " -u$ADMIN_USER -p$ADMIN_PWD --table -h$ADMIN_HOST -P$ADMIN_PORT < "$PROXY_CONF_DIR"/config.sql -set +x +MYSQL_PWD="$ADMIN_PWD" mysql --prompt="admin> " -u"$ADMIN_USER" --table -h"$ADMIN_HOST" -P"$ADMIN_PORT" < "$PROXY_CONF_DIR/config.sql"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` around lines 20 - 32, The script is leaking credentials via argv and shell tracing and leaves variables unquoted; update the two mysql calls to avoid passing the password on the command line by injecting it via the MYSQL_PWD environment variable (e.g. MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST" -P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s" ...) and similarly for the config apply (MYSQL_PWD="$ADMIN_PWD" mysql --prompt="admin> " -u"$ADMIN_USER" -h"$ADMIN_HOST" -P"$ADMIN_PORT" --table < "$PROXY_CONF_DIR/config.sql"), quote all variable expansions ("$ADMIN_HOST", "$ADMIN_PORT", "$ADMIN_USER", "$ADMIN_PWD", "$PROXY_CONF_DIR"), and disable shell tracing around sensitive commands (remove global set -x or wrap with set +x before the sensitive mysql invocations and restore set -x after if needed) so secrets are not exposed in CI logs.test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash (1)
20-20:⚠️ Potential issue | 🔴 CriticalProbe query is still invalid and will force timeout.
Line 20 uses
-c"SELECT", which is invalid SQL forpsql. The readiness loop can never succeed and will eventually hit timeout.🔧 Proposed fix
- PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" > /dev/null 2>&1 + PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c "SELECT 1;" > /dev/null 2>&1#!/bin/bash # Verify whether the invalid readiness probe is still present. # Expected result after fix: first command returns no matches; second command shows SELECT 1 probe. rg -nP 'psql\b.*-c\s*"SELECT"\b' test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash rg -nP 'psql\b.*-c\s*"SELECT 1;?"' test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash` at line 20, The psql readiness probe uses an invalid SQL string (-c"SELECT") so the loop never succeeds; update the command that contains PGPASSWORD, ON_ERROR_STOP and the psql invocation (psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT") to use a valid probe such as -c"SELECT 1" (or -c"SELECT 1;") while preserving the existing environment vars and redirection, so the readiness check can succeed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 20: The psql readiness probe uses an invalid SQL string (-c"SELECT") so
the loop never succeeds; update the command that contains PGPASSWORD,
ON_ERROR_STOP and the psql invocation (psql -h$PGSQL_HOST -p$PGSQL_PORT
-U$PGSQL_DB -c"SELECT") to use a valid probe such as -c"SELECT 1" (or -c"SELECT
1;") while preserving the existing environment vars and redirection, so the
readiness check can succeed.
In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Around line 14-25: The readiness loop only retries when RC == 1 and uses -gt
for timeout, which misses other non-zero failures and has an off-by-one; change
the loop to retry while the mysql command exit code is non-zero (e.g., loop on
RC -ne 0 or use an infinite loop and break on RC == 0) and change the timeout
check to use -ge against WAITED and TIMEOUT so the loop exits exactly when
WAITED reaches TIMEOUT; update the block that sets RC (the mysql call) and the
conditional that tests WAITED to use these new checks (variables: RC, WAITED,
TIMEOUT, ADMIN_HOST/PORT/USER/PWD) so readiness only returns on success and
timeout boundary is correct.
- Around line 20-32: The script is leaking credentials via argv and shell
tracing and leaves variables unquoted; update the two mysql calls to avoid
passing the password on the command line by injecting it via the MYSQL_PWD
environment variable (e.g. MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST"
-P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s" ...) and similarly for the config apply
(MYSQL_PWD="$ADMIN_PWD" mysql --prompt="admin> " -u"$ADMIN_USER" -h"$ADMIN_HOST"
-P"$ADMIN_PORT" --table < "$PROXY_CONF_DIR/config.sql"), quote all variable
expansions ("$ADMIN_HOST", "$ADMIN_PORT", "$ADMIN_USER", "$ADMIN_PWD",
"$PROXY_CONF_DIR"), and disable shell tracing around sensitive commands (remove
global set -x or wrap with set +x before the sensitive mysql invocations and
restore set -x after if needed) so secrets are not exposed in CI logs.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
test/tap/proxysql-ca.pemis excluded by!**/*.pemtest/tap/proxysql-cert.pemis excluded by!**/*.pemtest/tap/proxysql-key.pemis excluded by!**/*.pem
📒 Files selected for processing (7)
lib/PgSQL_Thread.cpptest/tap/groups/pgsql-repl/bin/docker-pgsql-post.bashtest/tap/groups/pgsql-repl/bin/docker-proxy-post.bashtest/tap/tap/utils.htest/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.shtest/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.shtest/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
✅ Files skipped from review due to trivial changes (1)
- test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
- test/tap/tap/utils.h
…monitoring - Fix SQL injection in pt-heartbeat query concatenation.\n- Fix copy-paste errors between mysql_thread_ and pgsql_thread_ variables.\n- Restore monitor_replication_lag_group_by_host logic.
Signed-off-by: René Cannaò <rene@proxysql.com>
- Fallback to 'cmake' if 'cmake3' is not found.\n- Fixes build failures on systems where CMake 3.x is provided as 'cmake' (Ubuntu/Debian/Fedora).
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
include/proxysql_structs.h (1)
1536-1541: Keep the PgSQL monitor externs in the same order as the definitions.The
externblock now orders these symbols differently than thePROXYSQL_EXTERNdefinition block aroundpgsql_thread___monitor_read_only_max_timeout_count. Re-aligning them would make future sync checks less error-prone.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/proxysql_structs.h` around lines 1536 - 1541, Reorder the extern declarations so they match the same order as their PROXYSQL_EXTERN definitions: ensure the sequence of pgsql_thread___monitor_replication_lag_interval, pgsql_thread___monitor_replication_lag_interval_window, pgsql_thread___monitor_replication_lag_timeout, pgsql_thread___monitor_replication_lag_count, pgsql_thread___monitor_replication_lag_use_percona_heartbeat, and pgsql_thread___monitor_read_only_max_timeout_count in include/proxysql_structs.h matches the definition block order; adjust the extern block ordering to be identical to the PROXYSQL_EXTERN definition block to keep symbols (e.g., pgsql_thread___monitor_replication_lag_interval, pgsql_thread___monitor_replication_lag_count, pgsql_thread___monitor_read_only_max_timeout_count) aligned for future sync checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/PgSQL_HostGroups_Manager.cpp`:
- Line 2719: The build fails because
pgsql_thread___monitor_replication_lag_group_by_host is used but not
declared/defined; add an extern declaration for
pgsql_thread___monitor_replication_lag_group_by_host (same type as the MySQL
counterpart mysql_thread___monitor_replication_lag_group_by_host) alongside the
existing thread flags in the header where those flags are declared, then add a
corresponding definition and initialization in the source file where the other
PGSQL thread globals are defined/initialized (follow the pattern used for
mysql_thread___monitor_replication_lag_group_by_host and its initialization in
the PgSQL thread init code) so the symbol is available at link time.
In `@lib/PgSQL_Monitor.cpp`:
- Around line 1849-1853: The replication-lag action currently passes a hardcoded
hostgroup id 0 in the is_task_success branch; update the call site in
PgSQL_Monitor.cpp (inside the is_task_success block that reads op_st.op_result
and repl_lag_res_t) to pass the real hostgroup id instead of 0 by plumbing the
hostgroup_id through the task state/repl_srvs/mon_srv_t or by reading
hostgroup_id from the existing task/mon_srv_t structure available in that
context, then call PgHGM->replication_lag_action with { hostgroup_id, srv.addr,
srv.port, op_result->val, false } so the correct hostgroup is used when
pgsql_thread___monitor_replication_lag_group_by_host == false.
- Around line 665-674: In the repl_lag handling branch (st.task.type ==
task_type_t::repl_lag) avoid calling std::atoi on a possibly NULL field: use
PQgetisnull(res, 0, 0) to detect NULL before calling PQgetvalue, and handle the
NULL case (e.g., set an explicit sentinel like -1 or an "unknown" value) instead
of parsing; then pass that sentinel into set_finish_st via the repl_lag_res_t as
before (so replace the direct std::atoi(PQgetvalue(...)) call with a PQgetisnull
check and conditional assignment to value prior to creating the op_result_t).
In `@test/tap/groups/groups.json`:
- Around line 302-303: There is a duplicate JSON key
"unit-strip_schema_from_query-t" in the groups JSON object; remove one of the
two entries so the key appears only once (keep the intended mapping to
"unit-tests-g1"), ensuring the resulting JSON is still valid (no trailing
commas) and passes linting so TAP group resolution works correctly.
---
Nitpick comments:
In `@include/proxysql_structs.h`:
- Around line 1536-1541: Reorder the extern declarations so they match the same
order as their PROXYSQL_EXTERN definitions: ensure the sequence of
pgsql_thread___monitor_replication_lag_interval,
pgsql_thread___monitor_replication_lag_interval_window,
pgsql_thread___monitor_replication_lag_timeout,
pgsql_thread___monitor_replication_lag_count,
pgsql_thread___monitor_replication_lag_use_percona_heartbeat, and
pgsql_thread___monitor_read_only_max_timeout_count in include/proxysql_structs.h
matches the definition block order; adjust the extern block ordering to be
identical to the PROXYSQL_EXTERN definition block to keep symbols (e.g.,
pgsql_thread___monitor_replication_lag_interval,
pgsql_thread___monitor_replication_lag_count,
pgsql_thread___monitor_read_only_max_timeout_count) aligned for future sync
checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cdc806cb-395a-49ae-8d7a-7fbf7601e40a
📒 Files selected for processing (8)
include/PgSQL_Thread.hinclude/proxysql_structs.hlib/PgSQL_HostGroups_Manager.cpplib/PgSQL_Monitor.cpplib/PgSQL_Thread.cpptest/tap/groups/groups.jsontest/tap/tap/Makefiletest/tap/tap/utils.h
🚧 Files skipped from review as they are similar to previous changes (3)
- test/tap/tap/utils.h
- include/PgSQL_Thread.h
- test/tap/tap/Makefile
lib/PgSQL_HostGroups_Manager.cpp
Outdated
| const int current_replication_lag = std::get<PgSQL_REPLICATION_LAG_SERVER_T::PG_RLS_CURRENT_REPLICATION_LAG>(server); | ||
|
|
||
| if (mysql_thread___monitor_replication_lag_group_by_host == false) { | ||
| if (pgsql_thread___monitor_replication_lag_group_by_host == false) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -e
# Expect at least one declaration and one definition for the new PgSQL thread-local.
rg -nP '(__thread|extern)\s+int\s+pgsql_thread___monitor_replication_lag_group_by_host\b' .
echo
echo "All references:"
rg -nP '\bpgsql_thread___monitor_replication_lag_group_by_host\b' .Repository: sysown/proxysql
Length of output: 41
🏁 Script executed:
sed -n '2715,2725p' lib/PgSQL_HostGroups_Manager.cppRepository: sysown/proxysql
Length of output: 683
🏁 Script executed:
rg -nP '\bmysql_thread___monitor_replication_lag_group_by_host\b' . | head -20Repository: sysown/proxysql
Length of output: 750
Add declaration and definition for pgsql_thread___monitor_replication_lag_group_by_host.
The PgSQL variant is used at line 2719 but is missing from all declaration and definition locations. The MySQL variant exists in include/proxysql_structs.h (lines 1380, 1708) and is initialized in lib/PgSQL_Thread.cpp. The PgSQL variant must be added to these same locations, or the build will fail with an undefined reference error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/PgSQL_HostGroups_Manager.cpp` at line 2719, The build fails because
pgsql_thread___monitor_replication_lag_group_by_host is used but not
declared/defined; add an extern declaration for
pgsql_thread___monitor_replication_lag_group_by_host (same type as the MySQL
counterpart mysql_thread___monitor_replication_lag_group_by_host) alongside the
existing thread flags in the header where those flags are declared, then add a
corresponding definition and initialization in the source file where the other
PGSQL thread globals are defined/initialized (follow the pattern used for
mysql_thread___monitor_replication_lag_group_by_host and its initialization in
the PgSQL thread init code) so the symbol is available at link time.
| } else if (st.task.type == task_type_t::repl_lag) { | ||
| const char* value_str { PQgetvalue(res, 0, 0) }; | ||
| int32_t value { std::atoi(value_str) }; | ||
|
|
||
| set_finish_st(st, ASYNC_QUERY_END, | ||
| op_result_t { | ||
| new repl_lag_res_t { value }, | ||
| [] (void* v) { delete static_cast<repl_lag_res_t*>(v); } | ||
| } | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the exact code mentioned
sed -n '665,674p' lib/PgSQL_Monitor.cppRepository: sysown/proxysql
Length of output: 391
🏁 Script executed:
# Search for REPLICATION_LAG_QUERY definition
rg "REPLICATION_LAG_QUERY" -nRepository: sysown/proxysql
Length of output: 428
🏁 Script executed:
# Check context around lines 665-674
sed -n '660,680p' lib/PgSQL_Monitor.cppRepository: sysown/proxysql
Length of output: 709
🏁 Script executed:
# Search for PQgetisnull usage in the codebase to see if it's already used elsewhere
rg "PQgetisnull" -nRepository: sysown/proxysql
Length of output: 1840
🏁 Script executed:
# Search for similar query result handling patterns
rg "PQgetvalue" -n -B2 -A2Repository: sysown/proxysql
Length of output: 40669
🏁 Script executed:
# Check the REPLICATION_LAG_QUERY definition
sed -n '45,70p' lib/PgSQL_Monitor.cppRepository: sysown/proxysql
Length of output: 993
🏁 Script executed:
# Check if there's similar NULL handling issue with readonly
sed -n '650,670p' lib/PgSQL_Monitor.cppRepository: sysown/proxysql
Length of output: 744
🏁 Script executed:
# Search for set_failed_st function to verify if the proposed fix approach exists
rg "set_failed_st" -n -B2 -A2Repository: sysown/proxysql
Length of output: 3649
Check for NULL replication-lag result before parsing.
REPLICATION_LAG_QUERY can return NULL when pg_last_xact_replay_timestamp() is NULL (standby with no replay history). Calling PQgetvalue(res, 0, 0) on a NULL field and passing it directly to std::atoi() results in undefined behavior—typically converting to 0, which masks an unknown lag as a healthy state and can incorrectly unshun a lagged server. Use PQgetisnull() to check before parsing.
Proposed fix
} else if (st.task.type == task_type_t::repl_lag) {
- const char* value_str { PQgetvalue(res, 0, 0) };
- int32_t value { std::atoi(value_str) };
-
- set_finish_st(st, ASYNC_QUERY_END,
- op_result_t {
- new repl_lag_res_t { value },
- [] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
- }
- );
+ if (PQgetisnull(res, 0, 0)) {
+ set_failed_st(
+ st,
+ ASYNC_QUERY_FAILED,
+ mf_unique_ptr<char>(strdup("NULL replication_lag"))
+ );
+ } else {
+ const char* value_str { PQgetvalue(res, 0, 0) };
+ int32_t value { static_cast<int32_t>(std::strtol(value_str, nullptr, 10)) };
+
+ set_finish_st(st, ASYNC_QUERY_END,
+ op_result_t {
+ new repl_lag_res_t { value },
+ [] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
+ }
+ );
+ }
} else {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if (st.task.type == task_type_t::repl_lag) { | |
| const char* value_str { PQgetvalue(res, 0, 0) }; | |
| int32_t value { std::atoi(value_str) }; | |
| set_finish_st(st, ASYNC_QUERY_END, | |
| op_result_t { | |
| new repl_lag_res_t { value }, | |
| [] (void* v) { delete static_cast<repl_lag_res_t*>(v); } | |
| } | |
| ); | |
| } else if (st.task.type == task_type_t::repl_lag) { | |
| if (PQgetisnull(res, 0, 0)) { | |
| set_failed_st( | |
| st, | |
| ASYNC_QUERY_FAILED, | |
| mf_unique_ptr<char>(strdup("NULL replication_lag")) | |
| ); | |
| } else { | |
| const char* value_str { PQgetvalue(res, 0, 0) }; | |
| int32_t value { static_cast<int32_t>(std::strtol(value_str, nullptr, 10)) }; | |
| set_finish_st(st, ASYNC_QUERY_END, | |
| op_result_t { | |
| new repl_lag_res_t { value }, | |
| [] (void* v) { delete static_cast<repl_lag_res_t*>(v); } | |
| } | |
| ); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/PgSQL_Monitor.cpp` around lines 665 - 674, In the repl_lag handling
branch (st.task.type == task_type_t::repl_lag) avoid calling std::atoi on a
possibly NULL field: use PQgetisnull(res, 0, 0) to detect NULL before calling
PQgetvalue, and handle the NULL case (e.g., set an explicit sentinel like -1 or
an "unknown" value) instead of parsing; then pass that sentinel into
set_finish_st via the repl_lag_res_t as before (so replace the direct
std::atoi(PQgetvalue(...)) call with a PQgetisnull check and conditional
assignment to value prior to creating the op_result_t).
| "unit-strip_schema_from_query-t": [ "unit-tests-g1" ], | ||
| "test_pgsql_replication_lag-t": [ "pgsql-repl" ], |
There was a problem hiding this comment.
Remove the duplicated unit-strip_schema_from_query-t mapping.
Line 302 re-declares a key that already exists at Line 305. This makes the JSON object ambiguous and currently fails lint, which can block TAP group resolution in CI. Keep only one declaration of unit-strip_schema_from_query-t.
Suggested fix
- "unit-strip_schema_from_query-t": [ "unit-tests-g1" ],
"test_pgsql_replication_lag-t": [ "pgsql-repl" ],
"mcp_query_rules-t": [ "ai-g1" ],
"unit-strip_schema_from_query-t": [ "unit-tests-g1" ]🧰 Tools
🪛 Biome (2.4.4)
[error] 302-302: The key unit-strip_schema_from_query-t was already declared.
(lint/suspicious/noDuplicateObjectKeys)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/tap/groups/groups.json` around lines 302 - 303, There is a duplicate
JSON key "unit-strip_schema_from_query-t" in the groups JSON object; remove one
of the two entries so the key appears only once (keep the intended mapping to
"unit-tests-g1"), ensuring the resulting JSON is still valid (no trailing
commas) and passes linting so TAP group resolution works correctly.
- Add hostgroup_id to mon_srv_t and ext_srvs.\n- Update fetch_updated_conf queries to return hostgroup_id.\n- Use real hostgroup_id in replication_lag_action call.
|




This PR adds initial support for replication lag for PostgreSQL monitoring. It can be configured through the following variables:
These variables work the same way as the variables for other PostgreSQL monitoring checks. The behavior of the replication checks and actions is analogous to the MySQL feature. One more variable was going to be introduced but is right now disabled:
This variable has been disabled until
pt-heartbeatsupport for PostgreSQL is fixed. See https://perconadev.atlassian.net/browse/PT-2030.Testing
A new TAP group with specific infra has been introduced for performing testing with real infra of the replication lag checks and actions. This initial testing provides automated test coverage for the feature.
Extra
I'm attaching an extra patch that can be used as template on how to implement connection matching for the monitoring connection pool. It's based on connection matching based on the connection parameters used when creating the connection itself. Since the connection pool offers automatic cleanup for non-used connections or errors, this patch allows for smooth config changes without monitoring errors taken place, and leverages the connection cleanup on the pool regular maintenance.
mon_connpool_match.patch.zip
This patch isn't needed right now, and there is no automatic testing for it, but it's left here for the future. Just in case of need.
Summary by CodeRabbit
New Features
Tests
Documentation
Chores