Skip to content

feat: add 'replication_lag' support for PostgreSQL monitoring#5422

Merged
renecannao merged 19 commits intov3.0from
v3.0-pgsql_monitor_repl_lag
Mar 6, 2026
Merged

feat: add 'replication_lag' support for PostgreSQL monitoring#5422
renecannao merged 19 commits intov3.0from
v3.0-pgsql_monitor_repl_lag

Conversation

@JavierJF
Copy link
Collaborator

@JavierJF JavierJF commented Feb 24, 2026

This PR adds initial support for replication lag for PostgreSQL monitoring. It can be configured through the following variables:

  • pgsql-monitor_replication_lag_interval
  • pgsql-monitor_replication_lag_interval_window
  • pgsql-monitor_replication_lag_timeout
  • pgsql-monitor_replication_lag_count

These variables work the same way as the variables for other PostgreSQL monitoring checks. The behavior of the replication checks and actions is analogous to the MySQL feature. One more variable was going to be introduced but is right now disabled:

  • pgsql-monitor_replication_lag_use_percona_heartbeat

This variable has been disabled until pt-heartbeat support for PostgreSQL is fixed. See https://perconadev.atlassian.net/browse/PT-2030.

Testing

A new TAP group with specific infra has been introduced for performing testing with real infra of the replication lag checks and actions. This initial testing provides automated test coverage for the feature.

Extra

I'm attaching an extra patch that can be used as template on how to implement connection matching for the monitoring connection pool. It's based on connection matching based on the connection parameters used when creating the connection itself. Since the connection pool offers automatic cleanup for non-used connections or errors, this patch allows for smooth config changes without monitoring errors taken place, and leverages the connection cleanup on the pool regular maintenance.

mon_connpool_match.patch.zip

This patch isn't needed right now, and there is no automatic testing for it, but it's left here for the future. Just in case of need.

Summary by CodeRabbit

  • New Features

    • PostgreSQL replication-lag monitoring: periodic checks, logging, OK/ERR counters in global status, and automated shunning when thresholds are exceeded.
    • New runtime setting: replication-lag interval window plus timeout/ping controls.
  • Tests

    • End-to-end replication-lag test suite with Docker-based infra, scripts, and test cases.
  • Documentation

    • README and setup instructions for replication-lag tests.
  • Chores

    • Deprecated legacy replication/topology variables and adjusted related runtime exposure.

@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds PostgreSQL replication-lag monitoring: new repl_lag task type, scheduling and worker integration, query handling (including pt-heartbeat), SQLite logging table and metrics, thread-local replication-lag variables, shunning logic adjustments, and TAP test infra for replication-lag scenarios.

Changes

Cohort / File(s) Summary
Core Headers
include/PgSQL_Monitor.hpp, include/PgSQL_Thread.h, include/proxysql_structs.h
Declared SQLite table constant for pgsql_server_replication_lag_log, added repl_lag counters to public metrics, introduced thread-local replication-lag variables (interval, interval_window, timeout, count, use_percona_heartbeat), and added MySQL error ER_PROXYSQL_REPL_LAG_TIMEOUT (9023).
Monitor Implementation
lib/PgSQL_Monitor.cpp
Added task_type_t::repl_lag, task-type string mapper, repl_lag SQLs (including pt-heartbeat), new result/params/config structs, config loading, get_task_query, async result parsing, timeout/ping handling, scheduling (next_repl_lag_at), maintenance (update_repl_lag_table, perf_repl_lag_actions), and integrated repl_lag into monitor lifecycle and scheduler.
Thread & Hostgroup Logic
lib/PgSQL_Thread.cpp, lib/PgSQL_HostGroups_Manager.cpp
Introduced monitor_replication_lag_interval_window, exposed replication-lag OK/ERR counters in global status, removed/disabled legacy topology/host-group and Percona heartbeat exposures, and switched shunning checks to use pgsql-specific replication-lag count/grouping variables.
TAP Test Infrastructure
test/tap/groups/pgsql-repl/*
Added full PostgreSQL primary/replica test infra: docker-compose.yaml, Dockerfile, init SQL, ProxySQL config.sql, env/constants, init/destroy scripts, post-init hooks, and network throttling scripts to simulate replication lag.
End-to-end Test
test/tap/tests/test_pgsql_replication_lag-t.cpp, .../test_pgsql_replication_lag-t.env, test/tap/groups/groups.json
New TAP test exercising replication-lag detection, reads/writes to pgsql_server_replication_lag_log, shunning checks, log inspection, and test-group mapping.
Test Utilities / Macros & Call Sites
test/tap/tap/utils.h, test/tap/tests/*.cpp
Changed macro CHECK_EXT_VAL signature to CHECK_EXT_VAL(conn, val), added SQ3_CHECK_EXT_VAL(val), updated call sites to pass connection/admin context, and added sqlite-specific error helper usage.
Scripts & Build Adjustments
test/tap/tap/Makefile, various test/tap/tests/* scripts
Added CMake policy flag to some cmake invocations and numerous orchestration scripts to support the new PGSQL replication tests.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler as Scheduler/Worker
    participant Monitor as PgSQL_Monitor
    participant Async as Async Task Handler
    participant PgDB as PostgreSQL
    participant LogDB as SQLite3
    participant Manager as HostGroups_Manager

    Scheduler->>Monitor: compute_next_intvs()/build_intv_batches -> schedule repl_lag
    Scheduler->>Async: start async repl_lag task
    Async->>PgDB: execute repl_lag query (regular or pt-heartbeat)
    PgDB-->>Async: return lag_value or error/timeout
    Async->>Monitor: handle_async_check_cont() with result
    Monitor->>Monitor: proc_task_state() -> perf_repl_lag_actions()
    Monitor->>LogDB: update_repl_lag_table() (insert log row)
    Monitor->>Manager: replication_lag_action_inner() evaluate counts/thresholds
    alt lag exceeds threshold count
        Manager->>Manager: mark server SHUNNED
    else
        Manager->>Manager: clear SHUNNED if needed
    end
    Monitor->>Monitor: increment repl_lag_check_OK/ERR counters
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I hopped through logs with tiny feet and pen,
Counting heartbeats of replicas now and then,
When lag stacks up I thump and sound the bell,
I note each tick in rows where slow ones dwell,
Happy whiskers — monitoring done again!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding replication lag monitoring support for PostgreSQL, which is the primary objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch v3.0-pgsql_monitor_repl_lag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @JavierJF, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances PostgreSQL monitoring capabilities by integrating replication lag detection. It enables ProxySQL to actively monitor the replication status of PostgreSQL instances, apply configurable thresholds, and perform shunning actions when replication lag exceeds defined limits. This feature aims to improve the reliability and availability of PostgreSQL deployments managed by ProxySQL, ensuring that client connections are not routed to severely lagged replicas.

Highlights

  • PostgreSQL Replication Lag Monitoring: Introduced new functionality to monitor PostgreSQL replication lag, allowing ProxySQL to track and react to replication delays.
  • Configurable Variables: Added several new configurable variables for replication lag monitoring, including pgsql-monitor_replication_lag_interval, _interval_window, _timeout, and _count, providing fine-grained control over the monitoring behavior.
  • Percona Toolkit Heartbeat Support (Disabled): Included initial support for pt-heartbeat for replication lag measurement, although this feature is currently disabled due to an upstream issue (PT-2030).
  • New Testing Infrastructure: A dedicated TAP group with specific Docker-based infrastructure has been added to provide automated testing for the new replication lag checks and shunning actions.
  • Internal Monitoring Logic Updates: Extended the internal PostgreSQL monitor logic to handle replication lag tasks, including new task types, result structures, query generation, and state processing for timeouts and success/failure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/PgSQL_Monitor.hpp
    • Added a new SQLite table definition for logging PostgreSQL server replication lag.
    • Introduced new counters (repl_lag_check_ERR, repl_lag_check_OK) to the PgSQL_Monitor struct for tracking replication lag check outcomes.
    • Registered the new replication lag log table within the internal monitor table definitions.
  • include/PgSQL_Thread.h
    • Added monitor_replication_lag_interval_window to the PgSQL_Threads_Handler variables.
    • Declared new thread-local variables for replication lag monitoring parameters (interval, interval_window, timeout, count, use_percona_heartbeat).
  • include/proxysql_structs.h
    • Defined a new error code ER_PROXYSQL_REPL_LAG_TIMEOUT for replication lag timeouts.
    • Added declarations for thread-local variables related to PostgreSQL replication lag monitoring parameters.
  • lib/PgSQL_HostGroups_Manager.cpp
    • Corrected variable names from mysql_thread___monitor_replication_lag_count to pgsql_thread___monitor_replication_lag_count in replication lag action logic.
    • Temporarily commented out the monitor_replication_lag_group_by_host check in replication_lag_action.
  • lib/PgSQL_Monitor.cpp
    • Defined SQL queries (REPLICATION_LAG_QUERY, REPLICATION_LAG_QUERY_PT_HEARTBEAT) for detecting PostgreSQL replication lag.
    • Extended the task_type_t enum and get_task_type_str function to include repl_lag tasks.
    • Introduced new data structures (repl_lag_res_t, repl_lag_params_t, repl_lag_conf_t) to manage replication lag results and configuration.
    • Integrated replication lag configuration into the tasks_conf_t structure and the fetch_updated_conf function.
    • Updated asynchronous task handling functions (handle_async_check_cont, handle_async_connect_cont, handle_async_connect_end) to process replication lag checks.
    • Implemented get_task_query to dynamically select the appropriate replication lag query, including pt-heartbeat support.
    • Modified functions for connection pool cleanup intervals, task success/finish checks, and task timeouts to account for the new repl_lag task type.
    • Added update_repl_lag_table to persist replication lag data to SQLite and perf_repl_lag_actions to trigger shunning based on lag.
    • Introduced variable synchronization logic within the worker_thread to ensure up-to-date monitoring configurations.
    • Added maintenance queries (MAINT_REPLICATION_LAG_LOG_QUERY) and logic to maint_mon_tables for cleaning up replication lag log entries.
    • Updated task scheduling functions (build_intv_batches, compute_next_intvs, PgSQL_monitor_scheduler_thread) to incorporate replication lag monitoring intervals.
  • lib/PgSQL_Thread.cpp
    • Updated the list of PostgreSQL monitor variables to include monitor_replication_lag_interval_window and enable monitor_replication_lag_use_percona_heartbeat.
    • Initialized monitor_replication_lag_interval_window and monitor_replication_lag_use_percona_heartbeat in the PgSQL_Threads_Handler constructor.
    • Modified variable retrieval functions (get_variable_string, get_variable) to correctly handle the new pt-heartbeat variable.
    • Added monitor_replication_lag_interval_window to the integer variable pointers map.
    • Ensured proper memory management for pgsql_thread___monitor_replication_lag_use_percona_heartbeat in the destructor.
    • Updated refresh_variables to load all new replication lag related configuration parameters.
    • Added global status variables (PgSQL_Monitor_replication_lag_check_OK, PgSQL_Monitor_replication_lag_check_ERR) to SQL3_GlobalStatus.
  • test/tap/groups/groups.json
    • Added the test_pgsql_replication_lag-t test to the pgsql-repl test group.
  • test/tap/groups/pgsql-repl/Dockerfile
    • Added a Dockerfile to create a PostgreSQL 17 image with iproute2 for network manipulation in tests.
  • test/tap/groups/pgsql-repl/README.md
    • Provided documentation for setting up and managing the PostgreSQL primary-replica testing infrastructure.
  • test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
    • Added a script to wait for the PostgreSQL service to initialize and create necessary test table structures.
  • test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
    • Added a script to wait for the ProxySQL admin interface to be ready and apply initial configurations for PostgreSQL monitoring.
  • test/tap/groups/pgsql-repl/conf/postgres/00_init.sql
    • Added an SQL script for initial PostgreSQL setup, including user creation (repluser, proxymon), replication slot setup, and database creation.
  • test/tap/groups/pgsql-repl/conf/proxysql/config.sql
    • Configured ProxySQL with PostgreSQL monitor credentials, replication lag interval, and defined PostgreSQL replication hostgroups and servers with max_replication_lag.
  • test/tap/groups/pgsql-repl/constants
    • Defined environment variables for PostgreSQL and ProxySQL admin connection details used in the testing environment.
  • test/tap/groups/pgsql-repl/docker-compose-destroy.bash
    • Added a script to tear down the Docker Compose environment, removing containers and volumes.
  • test/tap/groups/pgsql-repl/docker-compose-init.bash
    • Added a script to build Docker images, start the Docker Compose environment, and execute post-setup scripts for PostgreSQL and ProxySQL.
  • test/tap/groups/pgsql-repl/docker-compose.yaml
    • Provided a Docker Compose configuration for a PostgreSQL primary-replica setup, including network configurations and volume mounts for initialization scripts.
  • test/tap/groups/pgsql-repl/post-proxysql.bash
    • Added a script to shut down the PostgreSQL replication testing infrastructure.
  • test/tap/groups/pgsql-repl/pre-proxysql.bash
    • Added a script to clean up and then start the PostgreSQL replication testing infrastructure.
  • test/tap/groups/pgsql-repl/scripts/create_test_tables.sql
    • Added an SQL script to create a sbtest1 table in the sysbench database for testing purposes.
  • test/tap/tap/Makefile
    • Updated CMake policy version for cpp-dotenv to 3.5 in the Makefile.
  • test/tap/tap/utils.h
    • Modified the CHECK_EXT_VAL macro to accept a connection object, improving error logging context.
  • test/tap/tests/test_match_eof_conn_cap.cpp
    • Updated calls to the CHECK_EXT_VAL macro to pass the admin connection object.
  • test/tap/tests/test_pgsql_replication_lag-t.cpp
    • Added a new C++ test case (test_pgsql_replication_lag-t.cpp) to verify PostgreSQL replication lag monitoring and shunning functionality, including network throttling simulation.
    • Introduced PG_QUERY_T macro for PostgreSQL query execution and error handling within tests.
    • Implemented logic to create PostgreSQL connections, insert data to induce lag, and check ProxySQL's response (shunning, logging).
  • test/tap/tests/test_pgsql_replication_lag-t.env
    • Added environment variables (TAP_PGSQL_USERNAME, TAP_PGSQL_PASSWORD) required for the new PostgreSQL replication lag test.
  • test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
    • Added a shell script to remove network throttling rules applied to the primary PostgreSQL instance.
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
    • Added a shell script to set up network throttling on the primary PostgreSQL instance to simulate replication lag.
  • test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
    • Updated calls to the CHECK_EXT_VAL macro to pass the admin connection object.
Activity
  • Initial implementation of PostgreSQL replication lag monitoring feature.
  • New configuration variables for replication lag interval, window, timeout, and count were introduced.
  • A new testing infrastructure using Docker Compose for PostgreSQL primary-replica setup was added to validate the feature.
  • The pt-heartbeat integration for PostgreSQL replication lag is noted as temporarily disabled due to an external bug (PT-2030).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive PostgreSQL replication lag monitoring capabilities. Key changes include adding a new SQLite table (pgsql_server_replication_lag_log) to store replication lag data, new counters (repl_lag_check_ERR, repl_lag_check_OK), and a new error code (ER_PROXYSQL_REPL_LAG_TIMEOUT). The monitoring logic now includes specific SQL queries for replication lag detection, with an option to use Percona Heartbeat. New configuration variables (monitor_replication_lag_interval_window, monitor_replication_lag_use_percona_heartbeat) have been added and integrated into the thread handler and global status reporting. The code also includes mechanisms for scheduling replication lag checks, processing results, updating the SQLite log table, and performing hostgroup actions based on the detected lag. A new test group and test case (test_pgsql_replication_lag-t) have been added to validate this functionality, along with Docker Compose infrastructure for PostgreSQL primary-replica setup. Review comments highlighted the need for const-correctness in error counter updates, clarified the incomplete REPLICATION_LAG_QUERY_PT_HEARTBEAT query, addressed inconsistencies in handling the monitor_replication_lag_use_percona_heartbeat variable, and suggested avoiding system() calls in tests for security reasons.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/PgSQL_HostGroups_Manager.cpp (1)

2719-2731: ⚠️ Potential issue | 🟠 Major

PgSQL implementation ignores the monitor_replication_lag_group_by_host configuration variable.

The flag is registered as a configuration variable in PgSQL_Thread.cpp (line 2176, initialized and loaded from config at line 4035), but PgSQL_HostGroups_Manager::replication_lag_action() hard-codes the condition to false with the variable reference commented out. This diverges from the MySQL implementation (MySQL_HostGroups_Manager.cpp:2751), which correctly evaluates the variable. Either fix the PgSQL version to use the actual variable like MySQL does, or remove the flag registration and document the intentional divergence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_HostGroups_Manager.cpp` around lines 2719 - 2731, In
PgSQL_HostGroups_Manager::replication_lag_action(), restore use of the actual
configuration flag instead of the hard-coded "false": replace the commented-out
condition with the real symbol (monitor_replication_lag_group_by_host or the
exact variable name registered in PgSQL_Thread.cpp) so the code branches like
the MySQL implementation; keep the existing calls to MyHGC_find(...) and the
loop over MyHostGroups with replication_lag_action_inner(...) intact, or if the
flag is intentionally unsupported, remove its registration in PgSQL_Thread.cpp
and document the divergence—do not leave the flag registered while the branch is
forced to false.
lib/PgSQL_Monitor.cpp (1)

654-715: ⚠️ Potential issue | 🟠 Major

Add NULL and precision checks for replication lag values.

The code currently lacks a NULL check before parsing replication lag. When PostgreSQL is not in standby recovery, pg_last_xact_replay_timestamp() returns NULL, and the entire CASE expression evaluates to NULL. Using PQgetvalue() without PQgetisnull() followed by std::atoi() treats NULL as 0, preventing distinction between "no lag" and "not in recovery"—this masks problematic replicas and affects shunning decisions. Additionally, fractional seconds are truncated by int32_t and atoi(), losing precision.

Add PQgetisnull() check and parse with strtod() (or treat NULL as an error) to properly handle these cases before passing results to replication_lag_action().

Suggested fix for NULL/precision handling
-                } else if (st.task.type == task_type_t::repl_lag) {
-                    const char* value_str { PQgetvalue(res, 0, 0) };
-                    int32_t value { std::atoi(value_str) };
-
-                    set_finish_st(st, ASYNC_QUERY_END,
-                        op_result_t {
-                            new repl_lag_res_t { value },
-                            [] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
-                        }
-                    );
+                } else if (st.task.type == task_type_t::repl_lag) {
+                    if (PQgetisnull(res, 0, 0)) {
+                        set_failed_st(
+                            st, ASYNC_QUERY_FAILED,
+                            mf_unique_ptr<char>(strdup("replication_lag is NULL"))
+                        );
+                    } else {
+                        const char* value_str { PQgetvalue(res, 0, 0) };
+                        char* end = nullptr;
+                        double value_d = std::strtod(value_str, &end);
+                        if (end == value_str) {
+                            set_failed_st(
+                                st, ASYNC_QUERY_FAILED,
+                                mf_unique_ptr<char>(strdup("invalid replication_lag value"))
+                            );
+                        } else {
+                            int32_t value { static_cast<int32_t>(value_d) };
+                            set_finish_st(st, ASYNC_QUERY_END,
+                                op_result_t {
+                                    new repl_lag_res_t { value },
+                                    [] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
+                                }
+                            );
+                        }
+                    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_Monitor.cpp` around lines 654 - 715, The repl_lag branch currently
calls PQgetvalue + std::atoi without NULL checking, losing NULL-vs-zero
semantics and fractional precision; update the repl_lag handling (the block
under "else if (st.task.type == task_type_t::repl_lag)") to first call
PQgetisnull(res, 0, 0) and treat NULL as an error case (call set_failed_st with
a clear error message via mf_unique_ptr<char> / strdup) or otherwise propagate a
distinct "not in recovery" result, and replace std::atoi with strtod to parse
fractional seconds into a floating type (update repl_lag_res_t to hold double if
needed) before calling set_finish_st so replication_lag_action receives accurate
non-NULL, high‑precision values.
🧹 Nitpick comments (3)
test/tap/groups/pgsql-repl/Dockerfile (1)

3-3: Trim apt-get footprint.

Line 3 should use --no-install-recommends and clean apt lists to keep the image lean.

♻️ Suggested change
-RUN apt-get update && apt-get install -y iproute2
+RUN apt-get update && apt-get install -y --no-install-recommends iproute2 \
+    && rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/Dockerfile` at line 3, Update the RUN line that
installs iproute2 to reduce image size: modify the apt-get install invocation in
the Dockerfile (the RUN apt-get update && apt-get install -y iproute2 command)
to include --no-install-recommends and then clean up apt caches (e.g., apt-get
clean and remove /var/lib/apt/lists/*) in the same RUN layer so the image stays
lean.
test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash (1)

7-7: Quote the realpath/dirname expansion to avoid word-splitting.

Unquoted command substitutions can mis-handle paths containing spaces and are flagged by shellcheck.

🔧 Suggested fix
-PROXY_CONF_DIR=$(dirname $(realpath $0))/../conf/proxysql
+PROXY_CONF_DIR="$(dirname "$(realpath "$0")")/../conf/proxysql"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` at line 7, The
assignment to PROXY_CONF_DIR uses unquoted command substitution which can break
on paths with spaces; update the line that computes PROXY_CONF_DIR (the
$(realpath $0) and dirname usage) to quote the command substitution results and
the $0 expansion so the dirname/realpath output is treated as a single word
(i.e., use quoted "$(realpath "$0")" and quote the overall substitution when
assigning PROXY_CONF_DIR).
lib/PgSQL_Thread.cpp (1)

1305-1306: Avoid a silent no-op for disabled pt-heartbeat config.
Since internal retrieval is disabled, the admin setter still accepts non-empty values that won’t propagate to worker threads. Consider rejecting non-empty values (or warning) while PT-2030 is unresolved to avoid confusing operators.

♻️ Suggested guard in set_variable()
-        if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) {
+        if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) {
+            if (vallen > 0) {
+                proxy_warning("pgsql-monitor_replication_lag_use_percona_heartbeat is disabled until PT-2030 is resolved\n");
+                return false;
+            }
             if (vallen == 0) { // empty string
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_Thread.cpp` around lines 1305 - 1306, The disabled pt-heartbeat
config is a silent no-op: update set_variable() to guard against assigning
non-empty values to "monitor_replication_lag_use_percona_heartbeat" by detecting
when name == "monitor_replication_lag_use_percona_heartbeat" and value is
non-empty, and then reject the change (return an error/validation failure) or at
minimum log/warn and force it to empty; reference the
variables.monitor_replication_lag_use_percona_heartbeat symbol and ensure the
setter does not accept/retain non-empty inputs while PT-2030 remains unresolved
so operators get immediate feedback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 7: The assignment for POSTGRE_SETUP_DIR uses unquoted command
substitutions and $0 which can break on paths with spaces; update the derivation
to quote the inner expansions so that realpath and dirname receive "$0" and the
command substitutions are quoted (i.e., use dirname on the quoted realpath of
"$0" and quote the whole substitution when assigning to POSTGRE_SETUP_DIR), and
ensure any later uses of POSTGRE_SETUP_DIR are also quoted to avoid word
splitting.
- Around line 15-23: The probe loop is failing because the psql invocation uses
invalid SQL and the timeout message references the wrong DB; update the psql
probe command (the line invoking PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql
-h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" > /dev/null 2>&1) to execute
a valid probe such as -c "SELECT 1;" (or another simple, valid SQL) and change
the timeout error echo (the line checking WAITED vs TIMEOUT) to say "PostgreSQL"
instead of "MySQL"; keep the existing variables RC, WAITED, TIMEOUT, PGSQL_PWD,
PGSQL_HOST, PGSQL_PORT, PGSQL_DB and redirecting output as before.

In `@test/tap/tap/utils.h`:
- Around line 264-270: The CHECK_EXT_VAL macro logs the wrong error reporter for
SQLite-derived ext_val_t values; add a parallel macro (e.g.,
CHECK_EXT_VAL_SQLITE(conn, val)) that mirrors CHECK_EXT_VAL but calls
sq3_get_ext_val_err(val) (and uses sqlite3_errstr) when val originates from
sq3_query_ext_val, then update the two test call sites that use
CHECK_EXT_VAL(admin, last_id) for SQLite results to use
CHECK_EXT_VAL_SQLITE(admin, last_id) so diagnostics show the real SQLite3 error.

In `@test/tap/tests/test_pgsql_replication_lag-t.cpp`:
- Around line 93-124: The loop waiting for the replica to become ONLINE can time
out and let the test continue, causing flaky assertions; after using
mysql_query_ext_val (q_st_check) and CHECK_EXT_VAL to read is_online, if the
wait loop exceeds the timeout (j > timeout) or the final is_online.val is false,
fail the test immediately with a clear failure message instead of breaking and
continuing; update the code around q_st_check / is_online / mysql_query_ext_val
/ CHECK_EXT_VAL to check post-loop whether the replica is ONLINE and call the
test failure routine (or ASSERT/FAIL equivalent) with a descriptive message so
the test fails fast when replication never becomes ONLINE.

In `@test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh`:
- Around line 7-8: The vintf resolution fails due to nested/unescaped quoting
inside the docker exec --user root ... sh -c command; update the command that
sets vintf so the container receives the literal $rsubnet and the internal
grep/echo quotes are preserved (e.g., stop using an outer double-quoted sh -c
string that interpolates $rsubnet locally). Specifically, modify the docker exec
invocation that assigns vintf to escape the dollar sign (use \$rsubnet) or
switch to a single-quoted sh -c argument and keep the inner "$rsubnet" and
quoted regex intact so echo, xargs, ip route show to match, grep -E "$rsubnet",
and cut -d' ' -f3 run inside the container correctly.

In `@test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh`:
- Around line 7-8: The quoted sh -c string for vintf is broken by nested double
quotes around $rsubnet; fix by escaping the inner double quotes so the command
passed to docker remains a single quoted string (e.g. change grep -E "$rsubnet"
to grep -E \"$rsubnet\" within the sh -c argument) or alternatively restructure
the quoting (use single quotes for the sh -c wrapper and concatenate the
expanded $rsubnet) so the docker exec sh -c "..." invocation, the vintf
assignment, and the variables ($rsubnet) are preserved and the pipeline (echo |
xargs ip route show to match | grep -E ... | cut -d' ' -f3) returns the expected
interface.

In `@test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp`:
- Line 203: The test uses CHECK_EXT_VAL(admin, last_id) after sq3_query_ext_val
which may fail with SQLite-specific errors, but CHECK_EXT_VAL currently logs
mysql_error(admin) only; update the check to detect SQLite failures from last_id
(the return from sq3_query_ext_val) and log the SQLite error message instead:
either enhance the CHECK_EXT_VAL macro/utility in utils.h to branch when the DB
backend is SQLite (call sqlite3_errmsg on the sqlite handle stored in admin or
the relevant sqlite DB pointer) or replace this test call with an explicit check
that inspects last_id and, on non-MySQL errors, calls
sqlite3_errmsg(admin->sqlite_db) (or the correct sqlite handle) so the actual
SQLite error is logged when sq3_query_ext_val fails. Ensure references to
CHECK_EXT_VAL, last_id, sq3_query_ext_val and utils.h are updated accordingly.

---

Outside diff comments:
In `@lib/PgSQL_HostGroups_Manager.cpp`:
- Around line 2719-2731: In PgSQL_HostGroups_Manager::replication_lag_action(),
restore use of the actual configuration flag instead of the hard-coded "false":
replace the commented-out condition with the real symbol
(monitor_replication_lag_group_by_host or the exact variable name registered in
PgSQL_Thread.cpp) so the code branches like the MySQL implementation; keep the
existing calls to MyHGC_find(...) and the loop over MyHostGroups with
replication_lag_action_inner(...) intact, or if the flag is intentionally
unsupported, remove its registration in PgSQL_Thread.cpp and document the
divergence—do not leave the flag registered while the branch is forced to false.

In `@lib/PgSQL_Monitor.cpp`:
- Around line 654-715: The repl_lag branch currently calls PQgetvalue +
std::atoi without NULL checking, losing NULL-vs-zero semantics and fractional
precision; update the repl_lag handling (the block under "else if (st.task.type
== task_type_t::repl_lag)") to first call PQgetisnull(res, 0, 0) and treat NULL
as an error case (call set_failed_st with a clear error message via
mf_unique_ptr<char> / strdup) or otherwise propagate a distinct "not in
recovery" result, and replace std::atoi with strtod to parse fractional seconds
into a floating type (update repl_lag_res_t to hold double if needed) before
calling set_finish_st so replication_lag_action receives accurate non-NULL,
high‑precision values.

---

Nitpick comments:
In `@lib/PgSQL_Thread.cpp`:
- Around line 1305-1306: The disabled pt-heartbeat config is a silent no-op:
update set_variable() to guard against assigning non-empty values to
"monitor_replication_lag_use_percona_heartbeat" by detecting when name ==
"monitor_replication_lag_use_percona_heartbeat" and value is non-empty, and then
reject the change (return an error/validation failure) or at minimum log/warn
and force it to empty; reference the
variables.monitor_replication_lag_use_percona_heartbeat symbol and ensure the
setter does not accept/retain non-empty inputs while PT-2030 remains unresolved
so operators get immediate feedback.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Line 7: The assignment to PROXY_CONF_DIR uses unquoted command substitution
which can break on paths with spaces; update the line that computes
PROXY_CONF_DIR (the $(realpath $0) and dirname usage) to quote the command
substitution results and the $0 expansion so the dirname/realpath output is
treated as a single word (i.e., use quoted "$(realpath "$0")" and quote the
overall substitution when assigning PROXY_CONF_DIR).

In `@test/tap/groups/pgsql-repl/Dockerfile`:
- Line 3: Update the RUN line that installs iproute2 to reduce image size:
modify the apt-get install invocation in the Dockerfile (the RUN apt-get update
&& apt-get install -y iproute2 command) to include --no-install-recommends and
then clean up apt caches (e.g., apt-get clean and remove /var/lib/apt/lists/*)
in the same RUN layer so the image stays lean.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b710e5 and a4ea3b2.

📒 Files selected for processing (28)
  • include/PgSQL_Monitor.hpp
  • include/PgSQL_Thread.h
  • include/proxysql_structs.h
  • lib/PgSQL_HostGroups_Manager.cpp
  • lib/PgSQL_Monitor.cpp
  • lib/PgSQL_Thread.cpp
  • test/tap/groups/groups.json
  • test/tap/groups/pgsql-repl/Dockerfile
  • test/tap/groups/pgsql-repl/README.md
  • test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
  • test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
  • test/tap/groups/pgsql-repl/conf/postgres/00_init.sql
  • test/tap/groups/pgsql-repl/conf/proxysql/config.sql
  • test/tap/groups/pgsql-repl/constants
  • test/tap/groups/pgsql-repl/docker-compose-destroy.bash
  • test/tap/groups/pgsql-repl/docker-compose-init.bash
  • test/tap/groups/pgsql-repl/docker-compose.yaml
  • test/tap/groups/pgsql-repl/post-proxysql.bash
  • test/tap/groups/pgsql-repl/pre-proxysql.bash
  • test/tap/groups/pgsql-repl/scripts/create_test_tables.sql
  • test/tap/tap/Makefile
  • test/tap/tap/utils.h
  • test/tap/tests/test_match_eof_conn_cap.cpp
  • test/tap/tests/test_pgsql_replication_lag-t.cpp
  • test/tap/tests/test_pgsql_replication_lag-t.env
  • test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
  • test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash (1)

20-20: ⚠️ Potential issue | 🔴 Critical

Invalid SELECT probe still causes the wait loop to always timeout.

-c"SELECT" is not valid SQL — psql will error on every iteration, RC never becomes 0, and the script exits after 300 s unconditionally. The error-message fix from the earlier review was applied, but this part of the same issue was not.

🐛 Proposed fix
-		PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" > /dev/null 2>&1
+		PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h"$PGSQL_HOST" -p"$PGSQL_PORT" -U"$PGSQL_DB" -c "SELECT 1" > /dev/null 2>&1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash` at line 20, The probe
command using psql currently runs -c"SELECT" which is not valid SQL and always
fails; update the probe in the docker-pgsql-post script (the PGPASSWORD... psql
-h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" invocation) to use a valid,
lightweight query such as -c "SELECT 1" (or "SELECT 1;") so the return code
becomes 0 when the server is ready and the wait loop can succeed.
test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp (1)

203-203: Correct macro now used for SQLite3-originated ext_val_t.

last_id is produced by sq3_query_ext_val, so SQ3_CHECK_EXT_VAL correctly routes to sqlite3_errstr for error reporting. Same fix applied at line 288. This resolves the diagnostic mismatch flagged in the previous review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp` at line 203,
Replace the incorrect ext_val check macro with the SQLite3-specific one: use
SQ3_CHECK_EXT_VAL when validating the result stored in last_id produced by
sq3_query_ext_val so error reporting uses sqlite3_errstr; update the same usage
at the other occurrence (previously fixed at line ~288) to ensure consistent
SQLite3-originated ext_val_t handling.
🧹 Nitpick comments (3)
test/tap/tap/utils.h (2)

259-270: @param conn missing from CHECK_EXT_VAL doc block.

The updated macro signature accepts conn but the Doxygen block has no @param conn entry, leaving the parameter undocumented.

📝 Proposed fix
 /**
  * `@brief` Utility one-liner macro to check for query failure on a 'ext_val_t<T>'.
+ * `@param` conn An already opened MYSQL connection used to retrieve the error string.
  * `@param` val The 'ext_val_t<T>' to be checked.
  * `@return` In case of failure, 'EXIT_FAILURE' after logging the error, continues otherwise.
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tap/utils.h` around lines 259 - 270, The Doxygen block for the
CHECK_EXT_VAL macro is missing documentation for the new conn parameter; update
the comment above the macro to add an "@param conn" entry describing the
connection/context parameter passed to get_ext_val_err (used inside
CHECK_EXT_VAL) so all parameters (conn and val) are documented and consistent
with the macro signature.

272-283: SQ3_CHECK_EXT_VAL doc is an unedited copy of CHECK_EXT_VAL's.

The brief and description give no indication this macro is intended for sq3_query_ext_val-originated values, making it easy to misuse.

📝 Proposed fix
 /**
- * `@brief` Utility one-liner macro to check for query failure on a 'ext_val_t<T>'.
+ * `@brief` Utility one-liner macro to check for query failure on a SQLite3-originated 'ext_val_t<T>'.
+ * `@details` Use this macro when 'val' was produced by 'sq3_query_ext_val'. For MySQL-backed values
+ *  use 'CHECK_EXT_VAL(conn, val)' instead.
  * `@param` val The 'ext_val_t<T>' to be checked.
  * `@return` In case of failure, 'EXIT_FAILURE' after logging the error, continues otherwise.
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tap/utils.h` around lines 272 - 283, The macro SQ3_CHECK_EXT_VAL has
copied documentation from CHECK_EXT_VAL and doesn't state it's intended only for
ext_val_t values produced by sq3_query_ext_val, which can cause misuse; update
the comment for SQ3_CHECK_EXT_VAL to explicitly state it should be used only
with ext_val_t results originating from sq3_query_ext_val (or other sq3_* query
helpers), describe the behavior (logs via diag and returns EXIT_FAILURE on err),
and mention it uses sq3_get_ext_val_err to format the error; ensure the
`@brief/`@param/@return wording differentiates this macro from CHECK_EXT_VAL so
callers know the precondition.
lib/PgSQL_Thread.cpp (1)

1305-1306: Disabled heartbeat variable still has an active setter path.

These changes disable getter/refresh exposure, but set_variable() still accepts monitor_replication_lag_use_percona_heartbeat (Line 1715+), making it effectively write-only/no-op. Prefer rejecting updates explicitly until PT-2030 is fixed.

♻️ Suggested guard in setter
-		if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) {
-			if (vallen == 0) { // empty string
-				free(variables.monitor_replication_lag_use_percona_heartbeat);
-				variables.monitor_replication_lag_use_percona_heartbeat = strdup((value));
-				return true;
-			}
-			else {
-				re2::RE2::Options* opt2 = new re2::RE2::Options(RE2::Quiet);
-				opt2->set_case_sensitive(false);
-				char* patt = (char*)"`?([a-z\\d_]+)`?\\.`?([a-z\\d_]+)`?";
-				RE2* re = new RE2(patt, *opt2);
-				bool rc = false;
-				rc = RE2::FullMatch(value, *re);
-				delete re;
-				delete opt2;
-				if (rc) {
-					free(variables.monitor_replication_lag_use_percona_heartbeat);
-					variables.monitor_replication_lag_use_percona_heartbeat = strdup(value);
-					return true;
-				}
-				else {
-					proxy_error("%s is an invalid value for %s, not matching regex \"%s\"\n", value, name, patt);
-				}
-			}
-			return false;
-		}
+		if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) {
+			proxy_error(
+				"%s is currently disabled until pt-heartbeat PostgreSQL support is fixed (PT-2030)\n",
+				name
+			);
+			return false;
+		}

Also applies to: 1600-1601, 4023-4026

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_Thread.cpp` around lines 1305 - 1306, The disabled heartbeat
variable is still accepted by the setter, so modify set_variable (the function
that handles variable updates and currently accepts
"monitor_replication_lag_use_percona_heartbeat") to explicitly reject or return
an error for that name when PT-2030 is unresolved; locate the branch in
set_variable that checks variable names and add a guard that denies updates for
"monitor_replication_lag_use_percona_heartbeat" (and the other disabled variable
occurrences noted around the earlier mentions) with a clear error/return path so
the variable is not writable while its getter/refresh is disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 32: The psql invocation uses unquoted variables causing word-splitting
(SC2086); update the command that runs create_test_tables.sql so environment and
flags are properly quoted: quote $PGSQL_PWD in PGPASSWORD assignment if needed,
and wrap $PGSQL_HOST, $PGSQL_PORT, $PGSQL_DB and especially $POSTGRE_SETUP_DIR
in double quotes in the psql command and the input redirection (the line
invoking psql and redirecting < $POSTGRE_SETUP_DIR/create_test_tables.sql) so
paths with spaces are handled safely.

In `@test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh`:
- Line 8: The pipeline that finds the interface uses grep -E with the variable
rsubnet ("ip route show to match \"$rsubnet\" | grep -E \"$rsubnet\" | cut -d' '
-f3"), which treats dots as regex wildcards and can overmatch; replace grep -E
"$rsubnet" with fixed-string matching (grep -F "$rsubnet" or fgrep "$rsubnet")
so the route filtering uses a literal subnet match while leaving the surrounding
command and cut -d' ' -f3 unchanged.

---

Duplicate comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 20: The probe command using psql currently runs -c"SELECT" which is not
valid SQL and always fails; update the probe in the docker-pgsql-post script
(the PGPASSWORD... psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT"
invocation) to use a valid, lightweight query such as -c "SELECT 1" (or "SELECT
1;") so the return code becomes 0 when the server is ready and the wait loop can
succeed.

In `@test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp`:
- Line 203: Replace the incorrect ext_val check macro with the SQLite3-specific
one: use SQ3_CHECK_EXT_VAL when validating the result stored in last_id produced
by sq3_query_ext_val so error reporting uses sqlite3_errstr; update the same
usage at the other occurrence (previously fixed at line ~288) to ensure
consistent SQLite3-originated ext_val_t handling.

---

Nitpick comments:
In `@lib/PgSQL_Thread.cpp`:
- Around line 1305-1306: The disabled heartbeat variable is still accepted by
the setter, so modify set_variable (the function that handles variable updates
and currently accepts "monitor_replication_lag_use_percona_heartbeat") to
explicitly reject or return an error for that name when PT-2030 is unresolved;
locate the branch in set_variable that checks variable names and add a guard
that denies updates for "monitor_replication_lag_use_percona_heartbeat" (and the
other disabled variable occurrences noted around the earlier mentions) with a
clear error/return path so the variable is not writable while its getter/refresh
is disabled.

In `@test/tap/tap/utils.h`:
- Around line 259-270: The Doxygen block for the CHECK_EXT_VAL macro is missing
documentation for the new conn parameter; update the comment above the macro to
add an "@param conn" entry describing the connection/context parameter passed to
get_ext_val_err (used inside CHECK_EXT_VAL) so all parameters (conn and val) are
documented and consistent with the macro signature.
- Around line 272-283: The macro SQ3_CHECK_EXT_VAL has copied documentation from
CHECK_EXT_VAL and doesn't state it's intended only for ext_val_t values produced
by sq3_query_ext_val, which can cause misuse; update the comment for
SQ3_CHECK_EXT_VAL to explicitly state it should be used only with ext_val_t
results originating from sq3_query_ext_val (or other sq3_* query helpers),
describe the behavior (logs via diag and returns EXIT_FAILURE on err), and
mention it uses sq3_get_ext_val_err to format the error; ensure the
`@brief/`@param/@return wording differentiates this macro from CHECK_EXT_VAL so
callers know the precondition.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4ea3b2 and 0b0b679.

⛔ Files ignored due to path filters (3)
  • test/tap/proxysql-ca.pem is excluded by !**/*.pem
  • test/tap/proxysql-cert.pem is excluded by !**/*.pem
  • test/tap/proxysql-key.pem is excluded by !**/*.pem
📒 Files selected for processing (7)
  • lib/PgSQL_Thread.cpp
  • test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
  • test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
  • test/tap/tap/utils.h
  • test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
  • test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh

@JavierJF JavierJF force-pushed the v3.0-pgsql_monitor_repl_lag branch from 0b0b679 to 70bf127 Compare February 25, 2026 16:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash (1)

15-15: Off-by-one: timeout fires after 301 seconds, not 300.

$WAITED -gt $TIMEOUT with TIMEOUT=300 allows WAITED to reach 301 before the guard triggers (iterations 0 through 300 inclusive = 301 sleeps).

♻️ Proposed fix
-	if [ $WAITED -gt $TIMEOUT ]; then
+	if [ $WAITED -ge $TIMEOUT ]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` at line 15, The
timeout guard using the test "[ $WAITED -gt $TIMEOUT ]" fires one second late
(allows WAITED==TIMEOUT+1); update the condition to use greater-or-equal so the
timeout fires exactly at TIMEOUT (replace the check in the if statement that
references WAITED and TIMEOUT with a check using -ge), i.e., change the
conditional controlling the timeout guard in docker-proxy-post.bash from -gt to
-ge.
lib/PgSQL_Thread.cpp (1)

333-335: Make disabled heartbeat handling consistent in the setter path too.

monitor_replication_lag_use_percona_heartbeat is disabled in exposure/getters, but set_variable() still accepts and validates it. Consider explicitly rejecting writes to avoid partial/ambiguous behavior.

♻️ Proposed change
@@
-		if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) {
-			if (vallen == 0) { // empty string
-				free(variables.monitor_replication_lag_use_percona_heartbeat);
-				variables.monitor_replication_lag_use_percona_heartbeat = strdup((value));
-				return true;
-			}
-			else {
-				re2::RE2::Options* opt2 = new re2::RE2::Options(RE2::Quiet);
-				opt2->set_case_sensitive(false);
-				char* patt = (char*)"`?([a-z\\d_]+)`?\\.`?([a-z\\d_]+)`?";
-				RE2* re = new RE2(patt, *opt2);
-				bool rc = false;
-				rc = RE2::FullMatch(value, *re);
-				delete re;
-				delete opt2;
-				if (rc) {
-					free(variables.monitor_replication_lag_use_percona_heartbeat);
-					variables.monitor_replication_lag_use_percona_heartbeat = strdup(value);
-					return true;
-				}
-				else {
-					proxy_error("%s is an invalid value for %s, not matching regex \"%s\"\n", value, name, patt);
-				}
-			}
-			return false;
-		}
+		if (!strcasecmp(name, "monitor_replication_lag_use_percona_heartbeat")) {
+			proxy_error(
+				"%s is currently disabled until pt-heartbeat PostgreSQL support is fixed (PT-2030)\n",
+				name
+			);
+			return false;
+		}

Also applies to: 1305-1306, 1600-1601, 4023-4026

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_Thread.cpp` around lines 333 - 335, The getter/exposure for
monitor_replication_lag_use_percona_heartbeat was disabled but set_variable()
still accepts/validates it; update set_variable() to explicitly reject writes
for the disabled symbol by adding a top-level check for
"monitor_replication_lag_use_percona_heartbeat" that returns an error/false (and
logs/returns a clear message) instead of validating or applying the value, and
mirror the same explicit rejection logic in the other setter/validation spots
identified (the similar validation paths around the symbols on the other
occurrences) so writes are consistently refused across set_variable() and
related validation functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Line 20: Quote all variable expansions and stop passing the password on the
argv: replace unquoted uses like mysql -h$ADMIN_HOST -P$ADMIN_PORT -u$ADMIN_USER
-p$ADMIN_PWD -e"\s" with a quoted-invocation that supplies the password via an
environment variable (e.g. MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST"
-P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s") and quote any file paths such as
"$PROXY_CONF_DIR/config.sql"; ensure every $VAR is wrapped in double quotes to
prevent word-splitting/globbing and move the password into MYSQL_PWD (or use
--password and omit the value to prompt) so the secret is not exposed in the
process table.
- Around line 14-25: The loop currently checks while [ $RC -eq 1 ] which exits
on any non-1 failure; change the loop condition to treat every non-zero exit as
"not ready" (e.g., while [ $RC -ne 0 ]) so the loop keeps retrying until mysql
(the mysql -h$ADMIN_HOST -P$ADMIN_PORT ... -e"\s") returns 0 or TIMEOUT is
reached; keep the existing RC, WAITED and TIMEOUT variables and update the
condition to use numeric not-equal to zero semantics and quote variables where
appropriate.

In `@test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh`:
- Around line 5-16: The script computes rsubnet and vintf then runs tc against
$vintf but does not validate that vintf resolved to a single non-empty
interface, so empty or multi-line values cause late/noisy failures; update the
block around vintf, tbf_count and the tc invocation to validate rsubnet and
vintf immediately after resolving them (ensure rsubnet is non-empty and vintf
contains exactly one non-empty line/no whitespace/newlines), log a clear error
and exit non‑zero if validation fails, and only then proceed to compute
tbf_count and run tc qdisc del on the validated $vintf.

---

Nitpick comments:
In `@lib/PgSQL_Thread.cpp`:
- Around line 333-335: The getter/exposure for
monitor_replication_lag_use_percona_heartbeat was disabled but set_variable()
still accepts/validates it; update set_variable() to explicitly reject writes
for the disabled symbol by adding a top-level check for
"monitor_replication_lag_use_percona_heartbeat" that returns an error/false (and
logs/returns a clear message) instead of validating or applying the value, and
mirror the same explicit rejection logic in the other setter/validation spots
identified (the similar validation paths around the symbols on the other
occurrences) so writes are consistently refused across set_variable() and
related validation functions.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Line 15: The timeout guard using the test "[ $WAITED -gt $TIMEOUT ]" fires one
second late (allows WAITED==TIMEOUT+1); update the condition to use
greater-or-equal so the timeout fires exactly at TIMEOUT (replace the check in
the if statement that references WAITED and TIMEOUT with a check using -ge),
i.e., change the conditional controlling the timeout guard in
docker-proxy-post.bash from -gt to -ge.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0b679 and 70bf127.

⛔ Files ignored due to path filters (3)
  • test/tap/proxysql-ca.pem is excluded by !**/*.pem
  • test/tap/proxysql-cert.pem is excluded by !**/*.pem
  • test/tap/proxysql-key.pem is excluded by !**/*.pem
📒 Files selected for processing (7)
  • lib/PgSQL_Thread.cpp
  • test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
  • test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
  • test/tap/tap/utils.h
  • test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
  • test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
  • test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash

@JavierJF JavierJF force-pushed the v3.0-pgsql_monitor_repl_lag branch from 70bf127 to 26c6eab Compare February 27, 2026 09:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash (2)

14-25: ⚠️ Potential issue | 🟠 Major

Fix readiness-loop exit condition and timeout boundary.

On Lines 14-25, the loop currently retries only when RC == 1. Other non-zero failures (e.g., bad options, missing client, transient errors) break out early and continue before readiness is confirmed. Also, -gt causes an off-by-one timeout.

Suggested patch
-while [ $RC -eq 1 ]; do
-	if [ $WAITED -gt $TIMEOUT ]; then
+while [ "$RC" -ne 0 ]; do
+	if [ "$WAITED" -ge "$TIMEOUT" ]; then
 		echo "[ERROR] Timeout of $TIMEOUT seconds reached while connecting to ProxySQL"
 		exit 1
 	else
 		printf "."
-		mysql -h$ADMIN_HOST -P$ADMIN_PORT -u$ADMIN_USER -p$ADMIN_PWD -e"\s" > /dev/null 2>&1
+		mysql -h"$ADMIN_HOST" -P"$ADMIN_PORT" -u"$ADMIN_USER" -p"$ADMIN_PWD" -e "\s" > /dev/null 2>&1
 		RC=$?
 		WAITED=$((WAITED+1))
 		sleep 1
 	fi
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` around lines 14 - 25,
The readiness loop only retries when RC == 1 and uses -gt for timeout, which
misses other non-zero failures and has an off-by-one; change the loop to retry
while the mysql command exit code is non-zero (e.g., loop on RC -ne 0 or use an
infinite loop and break on RC == 0) and change the timeout check to use -ge
against WAITED and TIMEOUT so the loop exits exactly when WAITED reaches
TIMEOUT; update the block that sets RC (the mysql call) and the conditional that
tests WAITED to use these new checks (variables: RC, WAITED, TIMEOUT,
ADMIN_HOST/PORT/USER/PWD) so readiness only returns on success and timeout
boundary is correct.

20-32: ⚠️ Potential issue | 🟡 Minor

Avoid exposing credentials in argv/logs and quote all variable expansions.

Lines 20 and 32 still pass password via -p..., and Line 31 enables set -x, which can leak secrets into CI logs. Variable expansions are also unquoted in several places.

Suggested patch
-		mysql -h$ADMIN_HOST -P$ADMIN_PORT -u$ADMIN_USER -p$ADMIN_PWD -e"\s" > /dev/null 2>&1
+		MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST" -P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s" > /dev/null 2>&1
@@
-set -x
-mysql --prompt="admin> " -u$ADMIN_USER -p$ADMIN_PWD --table -h$ADMIN_HOST -P$ADMIN_PORT < "$PROXY_CONF_DIR"/config.sql
-set +x
+MYSQL_PWD="$ADMIN_PWD" mysql --prompt="admin> " -u"$ADMIN_USER" --table -h"$ADMIN_HOST" -P"$ADMIN_PORT" < "$PROXY_CONF_DIR/config.sql"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash` around lines 20 - 32,
The script is leaking credentials via argv and shell tracing and leaves
variables unquoted; update the two mysql calls to avoid passing the password on
the command line by injecting it via the MYSQL_PWD environment variable (e.g.
MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST" -P"$ADMIN_PORT" -u"$ADMIN_USER" -e
"\s" ...) and similarly for the config apply (MYSQL_PWD="$ADMIN_PWD" mysql
--prompt="admin> " -u"$ADMIN_USER" -h"$ADMIN_HOST" -P"$ADMIN_PORT" --table <
"$PROXY_CONF_DIR/config.sql"), quote all variable expansions ("$ADMIN_HOST",
"$ADMIN_PORT", "$ADMIN_USER", "$ADMIN_PWD", "$PROXY_CONF_DIR"), and disable
shell tracing around sensitive commands (remove global set -x or wrap with set
+x before the sensitive mysql invocations and restore set -x after if needed) so
secrets are not exposed in CI logs.
test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash (1)

20-20: ⚠️ Potential issue | 🔴 Critical

Probe query is still invalid and will force timeout.

Line 20 uses -c"SELECT", which is invalid SQL for psql. The readiness loop can never succeed and will eventually hit timeout.

🔧 Proposed fix
-		PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT" > /dev/null 2>&1
+		PGPASSWORD=$PGSQL_PWD ON_ERROR_STOP=1 psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c "SELECT 1;" > /dev/null 2>&1
#!/bin/bash
# Verify whether the invalid readiness probe is still present.
# Expected result after fix: first command returns no matches; second command shows SELECT 1 probe.
rg -nP 'psql\b.*-c\s*"SELECT"\b' test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
rg -nP 'psql\b.*-c\s*"SELECT 1;?"' test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash` at line 20, The psql
readiness probe uses an invalid SQL string (-c"SELECT") so the loop never
succeeds; update the command that contains PGPASSWORD, ON_ERROR_STOP and the
psql invocation (psql -h$PGSQL_HOST -p$PGSQL_PORT -U$PGSQL_DB -c"SELECT") to use
a valid probe such as -c"SELECT 1" (or -c"SELECT 1;") while preserving the
existing environment vars and redirection, so the readiness check can succeed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash`:
- Line 20: The psql readiness probe uses an invalid SQL string (-c"SELECT") so
the loop never succeeds; update the command that contains PGPASSWORD,
ON_ERROR_STOP and the psql invocation (psql -h$PGSQL_HOST -p$PGSQL_PORT
-U$PGSQL_DB -c"SELECT") to use a valid probe such as -c"SELECT 1" (or -c"SELECT
1;") while preserving the existing environment vars and redirection, so the
readiness check can succeed.

In `@test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash`:
- Around line 14-25: The readiness loop only retries when RC == 1 and uses -gt
for timeout, which misses other non-zero failures and has an off-by-one; change
the loop to retry while the mysql command exit code is non-zero (e.g., loop on
RC -ne 0 or use an infinite loop and break on RC == 0) and change the timeout
check to use -ge against WAITED and TIMEOUT so the loop exits exactly when
WAITED reaches TIMEOUT; update the block that sets RC (the mysql call) and the
conditional that tests WAITED to use these new checks (variables: RC, WAITED,
TIMEOUT, ADMIN_HOST/PORT/USER/PWD) so readiness only returns on success and
timeout boundary is correct.
- Around line 20-32: The script is leaking credentials via argv and shell
tracing and leaves variables unquoted; update the two mysql calls to avoid
passing the password on the command line by injecting it via the MYSQL_PWD
environment variable (e.g. MYSQL_PWD="$ADMIN_PWD" mysql -h"$ADMIN_HOST"
-P"$ADMIN_PORT" -u"$ADMIN_USER" -e "\s" ...) and similarly for the config apply
(MYSQL_PWD="$ADMIN_PWD" mysql --prompt="admin> " -u"$ADMIN_USER" -h"$ADMIN_HOST"
-P"$ADMIN_PORT" --table < "$PROXY_CONF_DIR/config.sql"), quote all variable
expansions ("$ADMIN_HOST", "$ADMIN_PORT", "$ADMIN_USER", "$ADMIN_PWD",
"$PROXY_CONF_DIR"), and disable shell tracing around sensitive commands (remove
global set -x or wrap with set +x before the sensitive mysql invocations and
restore set -x after if needed) so secrets are not exposed in CI logs.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70bf127 and 26c6eab.

⛔ Files ignored due to path filters (3)
  • test/tap/proxysql-ca.pem is excluded by !**/*.pem
  • test/tap/proxysql-cert.pem is excluded by !**/*.pem
  • test/tap/proxysql-key.pem is excluded by !**/*.pem
📒 Files selected for processing (7)
  • lib/PgSQL_Thread.cpp
  • test/tap/groups/pgsql-repl/bin/docker-pgsql-post.bash
  • test/tap/groups/pgsql-repl/bin/docker-proxy-post.bash
  • test/tap/tap/utils.h
  • test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
  • test/tap/tests/test_query_rules_fast_routing_algorithm-t.cpp
✅ Files skipped from review due to trivial changes (1)
  • test/tap/tests/test_pgsql_replication_lag/setup_primary_vintf_throttle.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/tap/tests/test_pgsql_replication_lag/delete_primary_vintf_throttle.sh
  • test/tap/tap/utils.h

…monitoring

- Fix SQL injection in pt-heartbeat query concatenation.\n- Fix copy-paste errors between mysql_thread_ and pgsql_thread_ variables.\n- Restore monitor_replication_lag_group_by_host logic.
Signed-off-by: René Cannaò <rene@proxysql.com>
- Fallback to 'cmake' if 'cmake3' is not found.\n- Fixes build failures on systems where CMake 3.x is provided as 'cmake' (Ubuntu/Debian/Fedora).
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
include/proxysql_structs.h (1)

1536-1541: Keep the PgSQL monitor externs in the same order as the definitions.

The extern block now orders these symbols differently than the PROXYSQL_EXTERN definition block around pgsql_thread___monitor_read_only_max_timeout_count. Re-aligning them would make future sync checks less error-prone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/proxysql_structs.h` around lines 1536 - 1541, Reorder the extern
declarations so they match the same order as their PROXYSQL_EXTERN definitions:
ensure the sequence of pgsql_thread___monitor_replication_lag_interval,
pgsql_thread___monitor_replication_lag_interval_window,
pgsql_thread___monitor_replication_lag_timeout,
pgsql_thread___monitor_replication_lag_count,
pgsql_thread___monitor_replication_lag_use_percona_heartbeat, and
pgsql_thread___monitor_read_only_max_timeout_count in include/proxysql_structs.h
matches the definition block order; adjust the extern block ordering to be
identical to the PROXYSQL_EXTERN definition block to keep symbols (e.g.,
pgsql_thread___monitor_replication_lag_interval,
pgsql_thread___monitor_replication_lag_count,
pgsql_thread___monitor_read_only_max_timeout_count) aligned for future sync
checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/PgSQL_HostGroups_Manager.cpp`:
- Line 2719: The build fails because
pgsql_thread___monitor_replication_lag_group_by_host is used but not
declared/defined; add an extern declaration for
pgsql_thread___monitor_replication_lag_group_by_host (same type as the MySQL
counterpart mysql_thread___monitor_replication_lag_group_by_host) alongside the
existing thread flags in the header where those flags are declared, then add a
corresponding definition and initialization in the source file where the other
PGSQL thread globals are defined/initialized (follow the pattern used for
mysql_thread___monitor_replication_lag_group_by_host and its initialization in
the PgSQL thread init code) so the symbol is available at link time.

In `@lib/PgSQL_Monitor.cpp`:
- Around line 1849-1853: The replication-lag action currently passes a hardcoded
hostgroup id 0 in the is_task_success branch; update the call site in
PgSQL_Monitor.cpp (inside the is_task_success block that reads op_st.op_result
and repl_lag_res_t) to pass the real hostgroup id instead of 0 by plumbing the
hostgroup_id through the task state/repl_srvs/mon_srv_t or by reading
hostgroup_id from the existing task/mon_srv_t structure available in that
context, then call PgHGM->replication_lag_action with { hostgroup_id, srv.addr,
srv.port, op_result->val, false } so the correct hostgroup is used when
pgsql_thread___monitor_replication_lag_group_by_host == false.
- Around line 665-674: In the repl_lag handling branch (st.task.type ==
task_type_t::repl_lag) avoid calling std::atoi on a possibly NULL field: use
PQgetisnull(res, 0, 0) to detect NULL before calling PQgetvalue, and handle the
NULL case (e.g., set an explicit sentinel like -1 or an "unknown" value) instead
of parsing; then pass that sentinel into set_finish_st via the repl_lag_res_t as
before (so replace the direct std::atoi(PQgetvalue(...)) call with a PQgetisnull
check and conditional assignment to value prior to creating the op_result_t).

In `@test/tap/groups/groups.json`:
- Around line 302-303: There is a duplicate JSON key
"unit-strip_schema_from_query-t" in the groups JSON object; remove one of the
two entries so the key appears only once (keep the intended mapping to
"unit-tests-g1"), ensuring the resulting JSON is still valid (no trailing
commas) and passes linting so TAP group resolution works correctly.

---

Nitpick comments:
In `@include/proxysql_structs.h`:
- Around line 1536-1541: Reorder the extern declarations so they match the same
order as their PROXYSQL_EXTERN definitions: ensure the sequence of
pgsql_thread___monitor_replication_lag_interval,
pgsql_thread___monitor_replication_lag_interval_window,
pgsql_thread___monitor_replication_lag_timeout,
pgsql_thread___monitor_replication_lag_count,
pgsql_thread___monitor_replication_lag_use_percona_heartbeat, and
pgsql_thread___monitor_read_only_max_timeout_count in include/proxysql_structs.h
matches the definition block order; adjust the extern block ordering to be
identical to the PROXYSQL_EXTERN definition block to keep symbols (e.g.,
pgsql_thread___monitor_replication_lag_interval,
pgsql_thread___monitor_replication_lag_count,
pgsql_thread___monitor_read_only_max_timeout_count) aligned for future sync
checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cdc806cb-395a-49ae-8d7a-7fbf7601e40a

📥 Commits

Reviewing files that changed from the base of the PR and between 26c6eab and 16f8837.

📒 Files selected for processing (8)
  • include/PgSQL_Thread.h
  • include/proxysql_structs.h
  • lib/PgSQL_HostGroups_Manager.cpp
  • lib/PgSQL_Monitor.cpp
  • lib/PgSQL_Thread.cpp
  • test/tap/groups/groups.json
  • test/tap/tap/Makefile
  • test/tap/tap/utils.h
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/tap/tap/utils.h
  • include/PgSQL_Thread.h
  • test/tap/tap/Makefile

const int current_replication_lag = std::get<PgSQL_REPLICATION_LAG_SERVER_T::PG_RLS_CURRENT_REPLICATION_LAG>(server);

if (mysql_thread___monitor_replication_lag_group_by_host == false) {
if (pgsql_thread___monitor_replication_lag_group_by_host == false) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e

# Expect at least one declaration and one definition for the new PgSQL thread-local.
rg -nP '(__thread|extern)\s+int\s+pgsql_thread___monitor_replication_lag_group_by_host\b' .

echo
echo "All references:"
rg -nP '\bpgsql_thread___monitor_replication_lag_group_by_host\b' .

Repository: sysown/proxysql

Length of output: 41


🏁 Script executed:

sed -n '2715,2725p' lib/PgSQL_HostGroups_Manager.cpp

Repository: sysown/proxysql

Length of output: 683


🏁 Script executed:

rg -nP '\bmysql_thread___monitor_replication_lag_group_by_host\b' . | head -20

Repository: sysown/proxysql

Length of output: 750


Add declaration and definition for pgsql_thread___monitor_replication_lag_group_by_host.

The PgSQL variant is used at line 2719 but is missing from all declaration and definition locations. The MySQL variant exists in include/proxysql_structs.h (lines 1380, 1708) and is initialized in lib/PgSQL_Thread.cpp. The PgSQL variant must be added to these same locations, or the build will fail with an undefined reference error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_HostGroups_Manager.cpp` at line 2719, The build fails because
pgsql_thread___monitor_replication_lag_group_by_host is used but not
declared/defined; add an extern declaration for
pgsql_thread___monitor_replication_lag_group_by_host (same type as the MySQL
counterpart mysql_thread___monitor_replication_lag_group_by_host) alongside the
existing thread flags in the header where those flags are declared, then add a
corresponding definition and initialization in the source file where the other
PGSQL thread globals are defined/initialized (follow the pattern used for
mysql_thread___monitor_replication_lag_group_by_host and its initialization in
the PgSQL thread init code) so the symbol is available at link time.

Comment on lines +665 to +674
} else if (st.task.type == task_type_t::repl_lag) {
const char* value_str { PQgetvalue(res, 0, 0) };
int32_t value { std::atoi(value_str) };

set_finish_st(st, ASYNC_QUERY_END,
op_result_t {
new repl_lag_res_t { value },
[] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
}
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the exact code mentioned
sed -n '665,674p' lib/PgSQL_Monitor.cpp

Repository: sysown/proxysql

Length of output: 391


🏁 Script executed:

# Search for REPLICATION_LAG_QUERY definition
rg "REPLICATION_LAG_QUERY" -n

Repository: sysown/proxysql

Length of output: 428


🏁 Script executed:

# Check context around lines 665-674
sed -n '660,680p' lib/PgSQL_Monitor.cpp

Repository: sysown/proxysql

Length of output: 709


🏁 Script executed:

# Search for PQgetisnull usage in the codebase to see if it's already used elsewhere
rg "PQgetisnull" -n

Repository: sysown/proxysql

Length of output: 1840


🏁 Script executed:

# Search for similar query result handling patterns
rg "PQgetvalue" -n -B2 -A2

Repository: sysown/proxysql

Length of output: 40669


🏁 Script executed:

# Check the REPLICATION_LAG_QUERY definition
sed -n '45,70p' lib/PgSQL_Monitor.cpp

Repository: sysown/proxysql

Length of output: 993


🏁 Script executed:

# Check if there's similar NULL handling issue with readonly
sed -n '650,670p' lib/PgSQL_Monitor.cpp

Repository: sysown/proxysql

Length of output: 744


🏁 Script executed:

# Search for set_failed_st function to verify if the proposed fix approach exists
rg "set_failed_st" -n -B2 -A2

Repository: sysown/proxysql

Length of output: 3649


Check for NULL replication-lag result before parsing.

REPLICATION_LAG_QUERY can return NULL when pg_last_xact_replay_timestamp() is NULL (standby with no replay history). Calling PQgetvalue(res, 0, 0) on a NULL field and passing it directly to std::atoi() results in undefined behavior—typically converting to 0, which masks an unknown lag as a healthy state and can incorrectly unshun a lagged server. Use PQgetisnull() to check before parsing.

Proposed fix
 				} else if (st.task.type == task_type_t::repl_lag) {
-					const char* value_str { PQgetvalue(res, 0, 0) };
-					int32_t value { std::atoi(value_str) };
-
-					set_finish_st(st, ASYNC_QUERY_END,
-						op_result_t {
-							new repl_lag_res_t { value },
-							[] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
-						}
-					);
+					if (PQgetisnull(res, 0, 0)) {
+						set_failed_st(
+							st,
+							ASYNC_QUERY_FAILED,
+							mf_unique_ptr<char>(strdup("NULL replication_lag"))
+						);
+					} else {
+						const char* value_str { PQgetvalue(res, 0, 0) };
+						int32_t value { static_cast<int32_t>(std::strtol(value_str, nullptr, 10)) };
+
+						set_finish_st(st, ASYNC_QUERY_END,
+							op_result_t {
+								new repl_lag_res_t { value },
+								[] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
+							}
+						);
+					}
 				} else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else if (st.task.type == task_type_t::repl_lag) {
const char* value_str { PQgetvalue(res, 0, 0) };
int32_t value { std::atoi(value_str) };
set_finish_st(st, ASYNC_QUERY_END,
op_result_t {
new repl_lag_res_t { value },
[] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
}
);
} else if (st.task.type == task_type_t::repl_lag) {
if (PQgetisnull(res, 0, 0)) {
set_failed_st(
st,
ASYNC_QUERY_FAILED,
mf_unique_ptr<char>(strdup("NULL replication_lag"))
);
} else {
const char* value_str { PQgetvalue(res, 0, 0) };
int32_t value { static_cast<int32_t>(std::strtol(value_str, nullptr, 10)) };
set_finish_st(st, ASYNC_QUERY_END,
op_result_t {
new repl_lag_res_t { value },
[] (void* v) { delete static_cast<repl_lag_res_t*>(v); }
}
);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/PgSQL_Monitor.cpp` around lines 665 - 674, In the repl_lag handling
branch (st.task.type == task_type_t::repl_lag) avoid calling std::atoi on a
possibly NULL field: use PQgetisnull(res, 0, 0) to detect NULL before calling
PQgetvalue, and handle the NULL case (e.g., set an explicit sentinel like -1 or
an "unknown" value) instead of parsing; then pass that sentinel into
set_finish_st via the repl_lag_res_t as before (so replace the direct
std::atoi(PQgetvalue(...)) call with a PQgetisnull check and conditional
assignment to value prior to creating the op_result_t).

Comment on lines +302 to +303
"unit-strip_schema_from_query-t": [ "unit-tests-g1" ],
"test_pgsql_replication_lag-t": [ "pgsql-repl" ],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove the duplicated unit-strip_schema_from_query-t mapping.

Line 302 re-declares a key that already exists at Line 305. This makes the JSON object ambiguous and currently fails lint, which can block TAP group resolution in CI. Keep only one declaration of unit-strip_schema_from_query-t.

Suggested fix
-  "unit-strip_schema_from_query-t": [ "unit-tests-g1" ],
   "test_pgsql_replication_lag-t": [ "pgsql-repl" ],
   "mcp_query_rules-t": [ "ai-g1" ],
   "unit-strip_schema_from_query-t": [ "unit-tests-g1" ]
🧰 Tools
🪛 Biome (2.4.4)

[error] 302-302: The key unit-strip_schema_from_query-t was already declared.

(lint/suspicious/noDuplicateObjectKeys)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/groups/groups.json` around lines 302 - 303, There is a duplicate
JSON key "unit-strip_schema_from_query-t" in the groups JSON object; remove one
of the two entries so the key appears only once (keep the intended mapping to
"unit-tests-g1"), ensuring the resulting JSON is still valid (no trailing
commas) and passes linting so TAP group resolution works correctly.

- Add hostgroup_id to mon_srv_t and ext_srvs.\n- Update fetch_updated_conf queries to return hostgroup_id.\n- Use real hostgroup_id in replication_lag_action call.
@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
E Security Rating on New Code (required ≥ A)
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@renecannao renecannao merged commit c60f390 into v3.0 Mar 6, 2026
11 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants