feat(cli): add multi-instance mode via subprocess spawning by paradixe · Pull Request #11 · ssmirr/conduit

paradixe · 2026-01-28T02:40:18Z

Spawns multiple conduit processes for higher client capacity.

Usage:

conduit start --multi-instance -m 500        # auto: 5 instances (100 clients each)
conduit start --instances 3 -m 300           # explicit: 3 instances

stats.json output:

{
  "liveInstances": 3,
  "totalInstances": 3,
  "connectedClients": 42,
  "instances": [
    {"id": "instance-0", "connected": 15},
    {"id": "instance-1", "connected": 14},
    {"id": "instance-2", "connected": 13}
  ]
}

How it works:

Parent spawns N child processes
Each child gets own data dir (~/.conduit/instance-N/)
Parent aggregates stats, writes combined stats.json
Clients/bandwidth split evenly across instances

Why subprocesses:
psiphon.SetNoticeWriter() is a global singleton. In-process multi-instance causes stats N-counting.

Solves the psiphon.SetNoticeWriter() global singleton issue by spawning separate conduit processes instead of running instances in-process. Changes: - Add --multi-instance flag to spawn N child processes (1 per 100 clients) - Add --instances flag for explicit instance count control - New multi.go: subprocess spawner with stats aggregation - Export FormatBytes/FormatDuration for use by multi-instance aggregator - Parent captures child STATS lines and prints AGGREGATE every 10s - Stats JSON includes per-instance breakdown - Final stats write on graceful shutdown Each child process has its own data directory with separate keys, avoiding the global SetNoticeWriter conflict that caused N-counted stats. Usage: conduit start --multi-instance -m 200 # spawns 2 instances conduit start --instances=3 -m 300 # explicit 3 instances

amirhnajafiz · 2026-01-28T04:32:10Z

@paradixe Why not just run separate instances as independent processes (or multiple docker containers)? In the proposed solution, if one of the sub-processes goes down (due to a network spike or any issue that causes it to reset), then all instances go down with it. Maybe a better retention policy would help, so clients aren’t lost when one instance fails?

paradixe · 2026-01-28T04:55:02Z

@paradixe Why not just run separate instances as independent processes (or multiple docker containers)? In the proposed solution, if one of the sub-processes goes down (due to a network spike or any issue that causes it to reset), then all instances go down with it. Maybe a better retention policy would help, so clients aren’t lost when one instance fails?

@amirhnajafiz

You can run separate containers/systemd units - that works fine. This just adds convenience (single command, unified stats).
Re failure isolation: children are independent - if one crashes, others keep running. Only risk is parent dying (then all children die). Could add auto-restart for failed children if needed.
Re "why subprocesses at all": see feat(cli): add run-multi command for parallel instances #7 is a global singleton. In-process multi-instance causes stats to be counted N times. Subprocesses give each instance its own globals.
Re "why no docker": overhead and compatibility with forks of this repo and people who are continuously running the stuff. We don't necessarily want to push people who don't want docker to add it, and further, docker itself can be fickle in some setups. IMO, docker comes after this is figured, and running one container with multi-instances > multiple conts with one instance.
Re "clients aren't lost": Client's are fairly fickle as is. Crashes are generally rare unless OOM or CPU bottleneck (even then, rare).

I appreciate the comment tho and it could be thought to:

Auto-restart failed children in the parent
Or: just document "use systemd with Restart=always"

But compatibility with previous releases makes huge / substantial change difficult to communicate and not really a fit for a repo with this many users.

amirhnajafiz · 2026-01-28T05:06:31Z

cli/internal/conduit/multi.go

+		wg.Add(1)
+		go func(idx int, dataDir string) {
+			defer wg.Done()
+			if err := m.runInstance(ctx, idx, dataDir, clientsPerInstance, bandwidthPerInstance); err != nil {


@paradixe Thanks for explaining it in detail. I totally understand why you needed subprocesses. My only concern was the restart policy—for example, here, instead of submitting the error to errChan, causing a return and terminating the parent, it might be better to just restart.

That said, I assume if the coordinator goes down, it doesn’t really matter. I’m also assuming the coordinator is just a goroutine that doesn’t do much, but it looks like a failure in one subprocess could stop the parent goroutine, and hence bring down the main process.

If you think this won’t cause any issues, then feel free to ignore my comment.

Thanks

To be completely honest, I'm a noob at Go and I very much appreciate you looking out. @ssmirr knows what he wants and he may have written something already so we'll see if this one even makes the cut

@paradixe This patch is actually pretty good. I really appreciate your effort and look forward to having it merged ASAP.

@amirhnajafiz @paradixe - thank you for the work, I was also beginning on a multi-instance fork but you all have done the work.

One comment and many questions:

Comment: each instance in the subprocess uses the same entry/exit network interface

Question: Same question as @amirhnajafiz, why not instead spin up multiple separate "instances" (containers, LXCs, VM, etc.)? What are the tradeoffs?

In my head this approach seems to better allocate CPU resources per "instance", which it seems is the bottleneck (we've spun up servers across ASNs and 1G/10G connections, seems CPU laden?

is this your experience too?)

But this may come at the tradeoff of easier to blacklist nodes (fqdn, IP). We've had that happen already, as I'm sure you all have experienced, earlier running servers using different protocols.

Thank you all again. This whole conduit project is incredible work, during what must be very personal time.

Guys I'm so happy for all the interest in this repo and the feedback I'm getting. Thank y'all!

Separate container instances works totally ok! I actually tested with that as a proof of concept before getting the "multi instance" idea out there. And please note, I'm keeping single instance mode so you can use that if it works better for you in a multi-container setup.

Problem is I don't want this CLI to manage containers, because I want to keep it as close as possible to upstream code (the goal is to stay up to date with the features they add + maybe contribute from this fork up if we think a feature is something that fits their philosophy).

But I think there is value in having a multi instance setup, and simplest way we can achieve this is on subprocess. A lot of people really just want to run a binary and not mess with Docker (specially if running on macOS / Windows).

@aburnap It would be great if you could share more about blacklisted node you experienced (I haven't seen this happen) -- feel free to share here or you can DM me on X, whichever works best.

@aburnap yes from what I've seen, usually the CPU is the bottleneck on most servers.

@ssmirr BTW, the code doesn't need to manage the containers/processes. We can patch a small shell script to run multiple instances. Or even use docker-compose. No need for code change.

ssmirr · 2026-01-28T06:18:40Z

Hey guys (@paradixe @amirhnajafiz)! It's awesome to see everyone is helping with the feature developments here. I really appreciate it!

I'm review the PR now and had some (somewhat-more-opinionated) WIP locally. I think I might mix the two and test it. I will keep you posted how it goes.

amirhnajafiz · 2026-01-28T06:23:34Z

@ssmirr Don't hesitate to ask if you need any help.

- Move regex compilation to package level (fix N-counting perf issue) - Show per-instance stats only with -v flag - Properly propagate parent verbosity to child processes - Always show connection events and errors regardless of verbosity

- Add BytesPerSecondToMbps constant (replaces hardcoded 125000) - Move byteMultipliers map to package level to avoid repeated allocation

Improves child process management with graceful shutdown, proper I/O handling, and better concurrency patterns. Changes: 1. Graceful Shutdown with Timeout (Two-Phase Pattern): - Use CommandContext to send SIGTERM when parent context cancelled - Child process receives signal and can cleanup gracefully - Force-kill with SIGKILL after 2s timeout if child still running - Prevents abrupt termination and allows connection cleanup 2. WaitGroup for I/O Goroutines: - Promote WaitGroup from local variable to struct field - Track stderr/stdout reader goroutines with Add/Done - Parent waits for all I/O to complete before exiting - Prevents truncated output during shutdown 3. Increased Scanner Buffer Size: - Create newLargeBufferScanner() helper function - 64KB initial buffer, 1MB maximum (vs 64KB default limit) - Prevents scanner failure on long lines (stack traces, verbose JSON) - Eliminates duplicate buffer configuration code (DRY) 4. Proper Stream Separation: - Write child stderr to os.Stderr (was os.Stdout) - Write child stdout to os.Stdout (unchanged) - Follows Unix convention: stderr=diagnostics, stdout=data - Enables separate redirection: conduit 2>errors.log 1>output.log 5. Concurrent I/O Reading: - Move stdout reading to goroutine (was blocking in main flow) - Both stderr and stdout now read concurrently - Prevents potential deadlock if child writes to both streams - Symmetric design improves code clarity Shutdown Flow: User signals shutdown (Ctrl+C) ↓ Context cancelled → CommandContext sends SIGTERM to child ↓ Child receives signal → starts graceful cleanup ↓ Monitor goroutine waits 2 seconds ↓ If child still alive → Force kill (SIGKILL) ↓ cmd.Wait() returns → I/O goroutines finish reading ↓ m.wg.Wait() blocks until all I/O complete ↓ Parent exits cleanly Constants Added: - ShutdownTimeout = 2s (grace period before force-kill)

Replace time-based periodic printing (every 10s) with event-driven printing that only outputs when stats actually change. Eliminates spam during idle periods and provides immediate feedback when stats update.

ssmirr · 2026-01-28T09:28:42Z

3 hours later 😅

It would be great if you all can also take a look at my changes. (Warning: it is a bit opinionated in some parts of it).

Will merge later in the day and make a build for a few to test before doing final release.

ssmirr · 2026-01-28T16:13:24Z

Here is a experimental build from this PR, please give multi-instance a try and let me know if you find any issues:

https://github.com/ssmirr/conduit/releases/tag/experimental-pr11

amirhnajafiz

@ssmirr Hope it helps.

cli/internal/conduit/multi.go

…ality This commit addresses critical feedback from code review: 1. Simplify clientsPerInstance calculation using max() - Replace 4-line if-statement with idiomatic max() builtin - Ensures minimum of 1 client per instance more concisely 2. Add missing scanner error handling - Check scanner.Err() after both stdout and stderr scan loops - Prevents silent failures on buffer overflow or I/O errors - Follows Go scanner best practices 3. Fix mutex/channel race condition in stats processing - CRITICAL: Eliminates lock contention under heavy load - Root cause: parseInstanceOutput() held m.mu while signaling m.statsChanged, causing aggregateAndPrintStats() to block waiting for the same mutex - Solution: Release mutex BEFORE signaling channel - Changed parseStatsLine() to return bool instead of signaling - Moved channel signaling to parseInstanceOutput() after unlock Race condition flow: Before: parseInstanceOutput() [HOLDS m.mu] → parseStatsLine() signals m.statsChanged → aggregateAndPrintStats() wakes, blocks on m.mu → Lock contention cascade under load After: parseInstanceOutput() [HOLDS m.mu] → parseStatsLine() returns changed flag → m.mu.Unlock() [RELEASES LOCK] → Signal m.statsChanged (no lock held) → aggregateAndPrintStats() acquires m.mu immediately → No contention This follows the critical concurrency pattern: separate data access (under mutex) from coordination (channel signaling) to prevent cascading lock contention in high-throughput scenarios.

ssmirr · 2026-01-28T20:44:30Z

This is great, thank you so much for the feedback @amirhnajafiz ! Appreciate your help, changes are applied.

Running a build here: https://github.com/ssmirr/conduit/actions/runs/21454769787/job/61792608490

amirhnajafiz · 2026-01-28T21:56:50Z

@ssmirr @paradixe Nice work all.

ssmirr · 2026-01-28T22:09:14Z

Great collaboration from everyone!

paradixe · 2026-01-28T22:53:33Z

Previous version (`2fd31d4`) vs experimental-pr11

Metric	Jan 27 (`2fd31d4`)	Jan 28 (experimental-pr11)
Instances	14	45 (multi-instance)
Peak concurrent clients	222	161
Cumulative upload	1.7 TB	3.1 TB
Cumulative download	20.5 TB	37.5 TB

Notes:

experimental-pr11 tested with -m 50/75/100 mix across 13 servers
~6 hour runtime, stable operation, no crashes
Traffic roughly doubled with multi-instance enabled

No crashes, stable operation. Seems to me that the broker is choosy and/or we have a lot of nodes which makes my nodes less desirable (all US based). If someone can test EU based that'd be more info. Thanks to everyone for their contributions and thoughts.

ssmirr · 2026-01-28T23:30:25Z

@paradixe could you update your comment above to add numbers from previous version on same servers for comparison?

paradixe added 3 commits January 27, 2026 19:33

chore: clean up inline comments

efb5541

fix: keep formatBytes/formatDuration unexported, restore x/term dep

c9b1559

ssmirr mentioned this pull request Jan 28, 2026

feat(cli): add --geo flag for client location tracking #5

Merged

amirhnajafiz reviewed Jan 28, 2026

View reviewed changes

ssmirr added 9 commits January 28, 2026 01:52

refactor(multi): replace magic number and optimize byte parsing

60eefb4

- Add BytesPerSecondToMbps constant (replaces hardcoded 125000) - Move byteMultipliers map to package level to avoid repeated allocation

refactor(multi): release mutex before I/O to reduce contention

5ea038f

refactor(multi): simplify multi-instance flag and add validation

bb5a621

fix(multi): remove forced minimum verbosity for child processes

b3d7fed

feat(multi): add process monitoring with restart and idle timeout

f063cc7

refactor(multi): change stats aggregation from periodic to event-driven

453e1e0

Replace time-based periodic printing (every 10s) with event-driven printing that only outputs when stats actually change. Eliminates spam during idle periods and provides immediate feedback when stats update.

refactor(multi): simplify instance directory naming to numeric only

5dd3e89

amirhnajafiz suggested changes Jan 28, 2026

View reviewed changes

cli/internal/conduit/multi.go Outdated Show resolved Hide resolved

cli/internal/conduit/multi.go Show resolved Hide resolved

cli/internal/conduit/multi.go Show resolved Hide resolved

cli/internal/conduit/multi.go Show resolved Hide resolved

Conversation

paradixe commented Jan 28, 2026

Uh oh!

amirhnajafiz commented Jan 28, 2026

Uh oh!

paradixe commented Jan 28, 2026

Uh oh!

amirhnajafiz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

paradixe Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

amirhnajafiz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

aburnap Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssmirr Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

amirhnajafiz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

ssmirr commented Jan 28, 2026

Uh oh!

amirhnajafiz commented Jan 28, 2026

Uh oh!

ssmirr commented Jan 28, 2026

Uh oh!

ssmirr commented Jan 28, 2026

Uh oh!

amirhnajafiz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssmirr commented Jan 28, 2026

Uh oh!

amirhnajafiz commented Jan 28, 2026

Uh oh!

ssmirr commented Jan 28, 2026

Uh oh!

paradixe commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Previous version (2fd31d4) vs experimental-pr11

Uh oh!

ssmirr commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aburnap Jan 28, 2026 •

edited

Loading

amirhnajafiz left a comment •

edited

Loading

paradixe commented Jan 28, 2026 •

edited

Loading

Previous version (`2fd31d4`) vs experimental-pr11