Skip to content

fix: failover reconnect + configurable max attempts#48

Merged
sparkison merged 2 commits intom3ue:devfrom
gabelul:fix/failover-input-error-reconnect
Mar 28, 2026
Merged

fix: failover reconnect + configurable max attempts#48
sparkison merged 2 commits intom3ue:devfrom
gabelul:fix/failover-input-error-reconnect

Conversation

@gabelul
Copy link
Copy Markdown

@gabelul gabelul commented Mar 26, 2026

Summary

Two fixes for failover reliability:

  1. Client disconnects during transcode input error failover — missing is_failover = True flag meant the HTTP response closed instead of seamlessly switching to the backup URL
  2. Hardcoded 3-attempt failover limit — streams with many failover channels (e.g. 11 across 2 providers) would exhaust all attempts on one dead provider without ever reaching the healthy one

What changed

Fix 1: Keep client connected during failover (commit 1)

The input_failed detection path during active streaming breaks out of the inner loop without setting is_failover = True. The outer loop sees it's False, exits the generator, and kills the HTTP response. The failover_event path had the flag set correctly — this just matches that behavior.

Fix 2: Configurable max failover attempts (commit 2)

New MAX_FAILOVER_ATTEMPTS setting in config (env var):

  • 0 (default): try all available failover URLs before giving up
    • Static failover list: uses the actual list length as the limit
    • Resolver-based: effectively unlimited, lets the resolver decide when to stop (returns null)
  • Any positive number: cap at that many attempts (old behavior was hardcoded to 3)

Applied to both direct streaming and transcoded streaming paths.

Tested in production

Deployed to a live setup with 2 IPTV providers (Trex + Strong), 3 concurrent users, channels with up to 11 failover URLs across both providers.

Fix 1 — before vs after:

BEFORE:
18:15:37 - Failover triggered for stream ae85426a...
18:15:37 - Last client disconnected  ← connection killed

AFTER:
18:30:28 - Failover triggered for stream ae85426a...
18:30:28 - Starting failover attempt 1/3 for client...  ← stays alive
18:30:51 - Starting failover attempt 2/3  ← still connected
18:31:53 - Starting failover attempt 3/3  ← still connected

Fix 2 — provider outage scenario:
With hardcoded max of 3, when Strong went down the proxy burned all 3 attempts on dead Strong URLs and never reached the working Trex ones. With the new default (try all), it cycles through every failover until it finds a live stream.

Test plan

  • Transcoded stream: kill primary source → verify client stays connected through failover
  • Static failover URLs: verify all URLs are tried before giving up
  • Resolver-based failover: verify proxy keeps trying until resolver returns null
  • MAX_FAILOVER_ATTEMPTS=5: verify it stops at 5
  • No failovers configured: verify default behavior unchanged (max 3)

The input_failed detection path during active streaming (line ~3114)
breaks out of the inner while loop without setting is_failover = True.
The outer loop then sees is_failover is False and breaks entirely,
closing the HTTP response and disconnecting the client.

The failover_event path (line ~3152) correctly sets is_failover = True
before breaking, allowing the outer loop to continue and reconnect
the client to the failover URL seamlessly.

Without this fix, every transcode_runtime_input_error failover kills
the client connection even though the proxy successfully resolves a
failover URL — the client never receives data from the backup stream.
@gabelul
Copy link
Copy Markdown
Author

gabelul commented Mar 26, 2026

Tested in production — this one's a real fix ✓

Deployed the patched proxy to my live setup (Hetzner dedicated, 2 IPTV providers, 3 concurrent users) and the difference is night and day.

The problem was brutal. Every time a transcoded stream hit an input error, the proxy would correctly resolve the failover URL, log FAILOVER_TRIGGERED, and then... drop the client anyway. The TV would freeze, the user had to manually switch channels and come back. Completely defeated the purpose of having failovers configured.

Root cause: The input_failed detection path during active streaming was missing is_failover = True before the break. The outer loop saw is_failover was False, hit the else: break, and the generator returned — killing the HTTP response. Meanwhile, the failover_event path (triggered by the API) had the flag set correctly and worked fine. Classic one-liner that's invisible until you trace the exact code path.

Before the fix (from my actual logs):

18:15:36 - Transcoding process encountered input error, triggering failover
18:15:37 - Failover resolver returned URL: http://smarter8k.ru/...
18:15:37 - Failover triggered for stream ae85426a...
18:15:37 - Last client disconnected from stream ae85426a  ← game over
18:15:37 - Cleaned up client: client_d85dc337...

After the fix:

18:30:27 - Transcoding process encountered input error, triggering failover
18:30:28 - Failover resolver returned URL: http://smarter8k.ru/...
18:30:28 - Failover triggered for stream ae85426a...
18:30:28 - Starting failover attempt 1/3 for client client_d85dc337...  ← stays alive!
18:30:51 - input error again → failover attempt 2/3 → client still connected
18:31:53 - input error again → failover attempt 3/3 → client still connected

The stream bounced between both providers three times in under two minutes and the TV never dropped. User saw a brief quality hiccup during switches but the stream kept playing. That's exactly how failover should work.

Tested with both the advanced failover resolver (calling back to m3u-editor for capacity checks) and the providers flipping between Trex and Strong sources. Solid. I've actually discovered this, been working on it, watching live a football match, and been having this issue. Seen and this seems to have sorted it, so nice.

Hardcoded limit of 3 failover attempts meant streams with many failover
channels (e.g. 11 across 2 providers) would exhaust attempts on one dead
provider without ever reaching the healthy one.

New behavior:
- MAX_FAILOVER_ATTEMPTS=0 (default): try all available failover URLs
  - Static failover list: uses len(failover_urls) as the limit
  - Resolver-based: effectively unlimited, lets the resolver decide
- MAX_FAILOVER_ATTEMPTS=N: cap at N attempts (old behavior with N=3)

Applied to both direct streaming and transcoded streaming paths.
@gabelul gabelul changed the title fix: keep client connected during transcode input error failover fix: failover reconnect + configurable max attempts Mar 26, 2026
@gabelul
Copy link
Copy Markdown
Author

gabelul commented Mar 26, 2026

I can break them into commits if you want to, but this is another problem that I have encountered today. A friend of mine was watching a channel that had 11 failover channels, and it was only trying the first three. The first 3 were from the same provider, so it never got to try the rest of them, which were working. I think this is a good addition. Thank you.

@sparkison sparkison changed the base branch from master to dev March 28, 2026 21:29
@sparkison
Copy link
Copy Markdown
Member

Makes sense, this was an arbitrary limit placed a while back. As a heads up, if you use the smart failover resolver, the limit is ignored (Settings > Proxy > Enable advanced failover logic).

@sparkison sparkison merged commit 5ed084d into m3ue:dev Mar 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants