A learning playground where you design software systems and get real feedback on how good they actually are.
Most system design learning stays in theory. Whiteboards. Blog posts. Interview prep guides. You read about caching, replicas, and load balancers, but nothing ever tells you "your design would cost $430/month and fail 0.3% of the time."
That gap is what this project tries to close.
You drag and drop real system components onto a canvas, connect them, and run a simulation. Not vague feedback. Concrete numbers.
The motto: Apply system design learnings. Get judged.
- Drag and drop components onto a canvas (clients, servers, databases, caches, queues, CDNs, and more)
- Connect them with sync (solid) or async (dashed) edges
- Run a simulation that produces:
- P50 / P95 / P99 latency in milliseconds
- Throughput ceiling (max RPS your design can handle)
- System availability (e.g. 99.87%) and expected downtime per month
- Monthly infrastructure cost estimate
- Single points of failure, bottlenecks, and a redundancy score
- Get warnings about design mistakes: missing cache layers, direct client-to-DB connections, synchronous fan-out, circular dependencies
- Stress-test with chaos scenarios: Black Friday (10x traffic), DB Primary Failure, Cold Start (zero cache hits)
- Compare designs side by side: "Cost +$73/mo, P99 latency -170ms, availability +0.09%"
Make system design a skill you can practice, not just read about.
A good architect should be able to answer:
- What breaks if my payment service goes down?
- Does this architecture handle 10k RPS or does it fall over at 2k?
- Would I even notice an outage with this observability setup?
- How many "nines" does this topology actually give me?
This tool answers those questions from a diagram.
The platform models your design as a directed graph and runs several algorithms over it. Here is a plain summary of each one.
Everything starts with your client nodes (Web, Mobile, IoT). The engine runs a BFS (Breadth-First Search) from those clients, following edges to find every node a real request can reach. Nodes with no path from a client are flagged as disconnected and skipped.
Every component has a base latency (API Gateway = 10ms, SQL DB = 10ms, Cache = 1ms). The engine finds the longest latency path from client to the deepest backend, following sync edges only. That path is your critical path, and it determines your P50.
P95 = P50 x (1 + 0.4 x serialHops x (1 - redundancyScore / 100))
P99 = P50 x (1 + 0.8 x serialHops x (1 - redundancyScore / 100))
More serial hops plus less redundancy equals fatter tails. This reflects what actually happens in production: variance compounds across every hop.
Each component type has a base SLA (a single App Server = 99.5%). Running 3 instances compounds correctly:
effectiveSLA = 1 - (1 - baseSLA)^instanceCount
System availability is the series product of all SLAs along the critical path. Adding a replica measurably improves the number.
Every component has a max RPS (a web server handles ~500 RPS, a microservice handles ~2000). The engine walks the critical path and finds the minimum throughput node. That node is your bottleneck and your system's hard ceiling.
The engine counts how many downstream services each compute node calls synchronously. If a service fans out to more than 3 sync dependencies, your actual downstream load is incomingRPS x N. This is one of the most common microservice mistakes and one of the first things a senior engineer checks.
For each node, the engine runs a reverse BFS on sync edges to count how many upstream nodes depend on it directly or through a chain. If that node fails, every ancestor loses the ability to serve requests. This gives each node a blastRadius score. The top 3 highest-risk nodes are surfaced in the results.
The engine checks whether your compute nodes are reachable from at least one tracing node, and whether your databases are reachable from at least one metrics node (using undirected BFS). It outputs a coverage score from 0 to 100%. Below 80%, you get a warning. Blind spots in observability mean silent outages.
When you connect two SQL databases with an edge, the engine checks whether it is async or sync:
- Async replication flags as AP (Available, Partition-tolerant): reads may lag behind writes
- Sync replication flags as CP (Consistent, Partition-tolerant): write latency goes up
Every node contributes to a monthly cost estimate based on its configuration (instance count, storage, throughput). Costs are modeled after real cloud pricing (AWS/GCP approximations). The compareCostVsPerf() function can run two configurations back-to-back and output the delta: what you gain and what you pay for it.
Before running the simulation, you can select a preset that temporarily changes the graph:
- Black Friday: 10x traffic, fatter tail latency
- DB Primary Failure: removes primary SQL DB nodes, traffic routes to replicas only
- Cold Start: zeroes out all cache hit rates, every request hits the database
Combined with Diff Mode, you see exactly how your architecture behaves under stress compared to the happy path.
| Rule | Why |
|---|---|
| No client node | Nothing to generate traffic from |
| No reachable compute | Requests have nowhere to go |
| Client directly wired to database | No app layer means no auth, no business logic |
| Circular synchronous dependency | Infinite blocking, deadlock |
| Load balancer with only one backend | Load balancer has no purpose |
| Message queue with no consumers | Messages pile up indefinitely |
| Warning | What It Catches |
|---|---|
| High read traffic, no cache | DB takes the hit on every read |
| Single database, no replica | One node failing means full downtime |
| No rate limiter on public entry | Open to traffic floods |
| Fan-out over 3 sync downstream calls | Downstream fleet gets amplified RPS |
| Async replication to a replica | Eventual consistency risk |
| Sync replication to a replica | Write latency penalty |
| Observability coverage under 80% | Blind spots in production monitoring |
| High blast radius nodes | Risk of cascading failure |
| Layer | Technology |
|---|---|
| UI Framework | React + TypeScript |
| Graph Canvas | ReactFlow |
| State Management | Zustand |
| Charts | Recharts |
| Styling | Tailwind CSS |
| Build Tool | Vite |
The codebase is organized around these key files:
src/
utils/
simulationValidator.ts <- All graph algorithms and simulation math
costEstimator.ts <- Cost models and compareCostVsPerf()
validation.ts <- Static design rule checks
store/
useStore.ts <- Global state (SimulationConfig, SimulationResult)
components/
panels/RightPanel.tsx <- Simulation controls, results UI, diff mode
canvas/ <- The drag-and-drop node canvas
data/
nodeTypes.ts <- Component definitions and categories
- Add a new component type: add it to
nodeTypes.ts, give it a base latency insimulationValidator.ts, and a cost model incostEstimator.ts - Add a new soft validation rule: extend
runSoftValidation()insimulationValidator.ts - Add a new chaos preset: add it to the
SimulationConfig.presetunion type and handle it inRightPanel.tsx'shandleRunSimulation - Improve cost accuracy: update the pricing in
costEstimator.tsfor a specific cloud provider - Add a new UI section: the right panel renders collapsible sections; follow the existing
SectionHeaderandMetricCardpattern
- All simulation logic lives in
utils/. No simulation math in React components. SimulationResultis the single output type. If you compute a new metric, add it there.- Sync edges = blocking calls. Async edges (dashed) = fire-and-forget. This distinction is central to every algorithm.
- Use seeded randomness only (
seededRandom(hashGraph(nodes, edges))). Never useMath.random()in simulation output.
| Algorithm | Time | Space |
|---|---|---|
| BFS reachability | O(n + e) | O(n) |
| All-paths enumeration | O(paths x depth) | O(n) |
| Critical path | O(paths x n) | O(n) |
| Cycle detection (DFS) | O(n + e) | O(n) |
| Blast radius (reverse BFS) | O(n x (n + e)) | O(n) |
| Cost estimation | O(n) | O(n) |
where n = number of nodes, e = number of edges.
MIT. Build on it, fork it, use it for teaching, use it to practice for your next system design interview.