feat(cli): add multi-instance mode via subprocess spawning#11
feat(cli): add multi-instance mode via subprocess spawning#11paradixe wants to merge 13 commits intossmirr:masterfrom
Conversation
Solves the psiphon.SetNoticeWriter() global singleton issue by spawning separate conduit processes instead of running instances in-process. Changes: - Add --multi-instance flag to spawn N child processes (1 per 100 clients) - Add --instances flag for explicit instance count control - New multi.go: subprocess spawner with stats aggregation - Export FormatBytes/FormatDuration for use by multi-instance aggregator - Parent captures child STATS lines and prints AGGREGATE every 10s - Stats JSON includes per-instance breakdown - Final stats write on graceful shutdown Each child process has its own data directory with separate keys, avoiding the global SetNoticeWriter conflict that caused N-counted stats. Usage: conduit start --multi-instance -m 200 # spawns 2 instances conduit start --instances=3 -m 300 # explicit 3 instances
|
@paradixe Why not just run separate instances as independent processes (or multiple docker containers)? In the proposed solution, if one of the sub-processes goes down (due to a network spike or any issue that causes it to reset), then all instances go down with it. Maybe a better retention policy would help, so clients aren’t lost when one instance fails? |
I appreciate the comment tho and it could be thought to:
But compatibility with previous releases makes huge / substantial change difficult to communicate and not really a fit for a repo with this many users. |
cli/internal/conduit/multi.go
Outdated
| wg.Add(1) | ||
| go func(idx int, dataDir string) { | ||
| defer wg.Done() | ||
| if err := m.runInstance(ctx, idx, dataDir, clientsPerInstance, bandwidthPerInstance); err != nil { |
There was a problem hiding this comment.
@paradixe Thanks for explaining it in detail. I totally understand why you needed subprocesses. My only concern was the restart policy—for example, here, instead of submitting the error to errChan, causing a return and terminating the parent, it might be better to just restart.
That said, I assume if the coordinator goes down, it doesn’t really matter. I’m also assuming the coordinator is just a goroutine that doesn’t do much, but it looks like a failure in one subprocess could stop the parent goroutine, and hence bring down the main process.
If you think this won’t cause any issues, then feel free to ignore my comment.
Thanks
There was a problem hiding this comment.
To be completely honest, I'm a noob at Go and I very much appreciate you looking out. @ssmirr knows what he wants and he may have written something already so we'll see if this one even makes the cut
There was a problem hiding this comment.
@paradixe This patch is actually pretty good. I really appreciate your effort and look forward to having it merged ASAP.
There was a problem hiding this comment.
@amirhnajafiz @paradixe - thank you for the work, I was also beginning on a multi-instance fork but you all have done the work.
One comment and many questions:
- Comment: each instance in the subprocess uses the same entry/exit network interface
- Question: Same question as @amirhnajafiz, why not instead spin up multiple separate "instances" (containers, LXCs, VM, etc.)? What are the tradeoffs?
- In my head this approach seems to better allocate CPU resources per "instance", which it seems is the bottleneck (we've spun up servers across ASNs and 1G/10G connections, seems CPU laden?
- is this your experience too?)
- But this may come at the tradeoff of easier to blacklist nodes (fqdn, IP). We've had that happen already, as I'm sure you all have experienced, earlier running servers using different protocols.
Thank you all again. This whole conduit project is incredible work, during what must be very personal time.
There was a problem hiding this comment.
Guys I'm so happy for all the interest in this repo and the feedback I'm getting. Thank y'all!
Separate container instances works totally ok! I actually tested with that as a proof of concept before getting the "multi instance" idea out there. And please note, I'm keeping single instance mode so you can use that if it works better for you in a multi-container setup.
Problem is I don't want this CLI to manage containers, because I want to keep it as close as possible to upstream code (the goal is to stay up to date with the features they add + maybe contribute from this fork up if we think a feature is something that fits their philosophy).
But I think there is value in having a multi instance setup, and simplest way we can achieve this is on subprocess. A lot of people really just want to run a binary and not mess with Docker (specially if running on macOS / Windows).
@aburnap It would be great if you could share more about blacklisted node you experienced (I haven't seen this happen) -- feel free to share here or you can DM me on X, whichever works best.
@aburnap yes from what I've seen, usually the CPU is the bottleneck on most servers.
There was a problem hiding this comment.
@ssmirr BTW, the code doesn't need to manage the containers/processes. We can patch a small shell script to run multiple instances. Or even use docker-compose. No need for code change.
|
Hey guys (@paradixe @amirhnajafiz)! It's awesome to see everyone is helping with the feature developments here. I really appreciate it! I'm review the PR now and had some (somewhat-more-opinionated) WIP locally. I think I might mix the two and test it. I will keep you posted how it goes. |
|
@ssmirr Don't hesitate to ask if you need any help. |
- Move regex compilation to package level (fix N-counting perf issue) - Show per-instance stats only with -v flag - Properly propagate parent verbosity to child processes - Always show connection events and errors regardless of verbosity
- Add BytesPerSecondToMbps constant (replaces hardcoded 125000) - Move byteMultipliers map to package level to avoid repeated allocation
Improves child process management with graceful shutdown, proper I/O handling,
and better concurrency patterns.
Changes:
1. Graceful Shutdown with Timeout (Two-Phase Pattern):
- Use CommandContext to send SIGTERM when parent context cancelled
- Child process receives signal and can cleanup gracefully
- Force-kill with SIGKILL after 2s timeout if child still running
- Prevents abrupt termination and allows connection cleanup
2. WaitGroup for I/O Goroutines:
- Promote WaitGroup from local variable to struct field
- Track stderr/stdout reader goroutines with Add/Done
- Parent waits for all I/O to complete before exiting
- Prevents truncated output during shutdown
3. Increased Scanner Buffer Size:
- Create newLargeBufferScanner() helper function
- 64KB initial buffer, 1MB maximum (vs 64KB default limit)
- Prevents scanner failure on long lines (stack traces, verbose JSON)
- Eliminates duplicate buffer configuration code (DRY)
4. Proper Stream Separation:
- Write child stderr to os.Stderr (was os.Stdout)
- Write child stdout to os.Stdout (unchanged)
- Follows Unix convention: stderr=diagnostics, stdout=data
- Enables separate redirection: conduit 2>errors.log 1>output.log
5. Concurrent I/O Reading:
- Move stdout reading to goroutine (was blocking in main flow)
- Both stderr and stdout now read concurrently
- Prevents potential deadlock if child writes to both streams
- Symmetric design improves code clarity
Shutdown Flow:
User signals shutdown (Ctrl+C)
↓
Context cancelled → CommandContext sends SIGTERM to child
↓
Child receives signal → starts graceful cleanup
↓
Monitor goroutine waits 2 seconds
↓
If child still alive → Force kill (SIGKILL)
↓
cmd.Wait() returns → I/O goroutines finish reading
↓
m.wg.Wait() blocks until all I/O complete
↓
Parent exits cleanly
Constants Added:
- ShutdownTimeout = 2s (grace period before force-kill)
Replace time-based periodic printing (every 10s) with event-driven printing that only outputs when stats actually change. Eliminates spam during idle periods and provides immediate feedback when stats update.
|
3 hours later 😅 It would be great if you all can also take a look at my changes. (Warning: it is a bit opinionated in some parts of it). Will merge later in the day and make a build for a few to test before doing final release. |
|
Here is a experimental build from this PR, please give multi-instance a try and let me know if you find any issues: https://github.com/ssmirr/conduit/releases/tag/experimental-pr11 |
…ality
This commit addresses critical feedback from code review:
1. Simplify clientsPerInstance calculation using max()
- Replace 4-line if-statement with idiomatic max() builtin
- Ensures minimum of 1 client per instance more concisely
2. Add missing scanner error handling
- Check scanner.Err() after both stdout and stderr scan loops
- Prevents silent failures on buffer overflow or I/O errors
- Follows Go scanner best practices
3. Fix mutex/channel race condition in stats processing
- CRITICAL: Eliminates lock contention under heavy load
- Root cause: parseInstanceOutput() held m.mu while signaling
m.statsChanged, causing aggregateAndPrintStats() to block
waiting for the same mutex
- Solution: Release mutex BEFORE signaling channel
- Changed parseStatsLine() to return bool instead of signaling
- Moved channel signaling to parseInstanceOutput() after unlock
Race condition flow:
Before:
parseInstanceOutput() [HOLDS m.mu]
→ parseStatsLine() signals m.statsChanged
→ aggregateAndPrintStats() wakes, blocks on m.mu
→ Lock contention cascade under load
After:
parseInstanceOutput() [HOLDS m.mu]
→ parseStatsLine() returns changed flag
→ m.mu.Unlock() [RELEASES LOCK]
→ Signal m.statsChanged (no lock held)
→ aggregateAndPrintStats() acquires m.mu immediately
→ No contention
This follows the critical concurrency pattern: separate data access
(under mutex) from coordination (channel signaling) to prevent
cascading lock contention in high-throughput scenarios.
|
This is great, thank you so much for the feedback @amirhnajafiz ! Appreciate your help, changes are applied. Running a build here: https://github.com/ssmirr/conduit/actions/runs/21454769787/job/61792608490 |
|
Great collaboration from everyone! |
Previous version (2fd31d4) vs experimental-pr11
Notes:
No crashes, stable operation. Seems to me that the broker is choosy and/or we have a lot of nodes which makes my nodes less desirable (all US based). If someone can test EU based that'd be more info. Thanks to everyone for their contributions and thoughts. |
|
@paradixe could you update your comment above to add numbers from previous version on same servers for comparison? |
Spawns multiple conduit processes for higher client capacity.
Usage:
stats.json output:
{ "liveInstances": 3, "totalInstances": 3, "connectedClients": 42, "instances": [ {"id": "instance-0", "connected": 15}, {"id": "instance-1", "connected": 14}, {"id": "instance-2", "connected": 13} ] }How it works:
~/.conduit/instance-N/)Why subprocesses:
psiphon.SetNoticeWriter() is a global singleton. In-process multi-instance causes stats N-counting.