Skip to content

[Bug] big individuallyDeletedMessages causes message dispatching hangs #25028

@YanshuoH

Description

@YanshuoH

Search before reporting

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

  • broker version: 4.0.8
  • broker os: Linux pulsar-broker-1a-0 6.12.40-64.114.amzn2023.aarch64 javascript client #1 SMP Tue Aug 26 05:25:54 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
  • java: openjdk version "17.0.12" 2024-07-16
  • client: golang
  • client version: 0.17.0
  • client os: same as broker
  • client java version: NaN

Issue Description

One of our scenario is check user's payment with variadic delay, from 10s to 1h indifferent.
My observation is that when the individuallyDeletedMessages becomes quite big (100,000+, and the setting managedLedgerMaxUnackedRangesToPersist is 100,000 too), dispatching of messages become strange. The message dispatch is very slow and most messages don't get dispatched.
Checking the internal-stats, I can see something as such:

      "numberOfEntriesSinceFirstNotAckedMessage": 751170,
      "totalNonContiguousDeletedMessagesRange": 105911,

No more error message on both client and server side.

I see there's a similar issue #23200, yet we're using Shared subscription type.

Error messages

The suspicious message I got is:
client side tries to reconnect to the broker with:

INFO[0960] Connecting to broker                          remote_addr="pulsar://pulsar-broker.pulsar1.svc.cluster.local:6650"
INFO[0960] TCP connection established                    local_addr="10.120.147.140:56018" remote_addr="pulsar://pulsar-broker.pulsar1.svc.cluster.local:6650"
INFO[0960] Connection is ready                           local_addr="10.120.147.140:56018" remote_addr="pulsar://pulsar-broker.pulsar1.svc.cluster.local:6650"


And the server has a shedding performed.

Since it is very costy to have the DEBUG level log turned on, I didn't have the chance to catch debug level messages.

Reproducing the issue

I've written two parts that can reproduce such issue.
Producer that would delivery messages with variadic delay (from 10s to 1h).
Consumer that would receive messages.

Wait for the message cumulate until the expected number, the consumer hangs with very little message received.

Additional information

It might relates to the setting of managedLedgerMaxUnackedRangesToPersist but for our usage type, it is not possible to increase this setting infinitely because the message would grow.
Also I've notice that when the individuallyDeletedMessages is quite big, every time a consumer reconnect to the broker would cause both broker and zookeeper to have a peak CPU usage, I assume it is because pulsar was trying to compute the actual messages that shall be dispatched.
I wonder if there's a way to optimize such issue or a way to tune it ? Or this is not the correct way of using pulsar ?

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions