A distributed system for training Large Language Models across multiple machines using data parallelism, gRPC communication, and synchronous gradient aggregation.
┌───────────────────┐
│ MASTER │
│ (Go Server) │
│ │
│ Gradient Server │
│ Worker Manager │
│ Scheduler │
└─────────┬─────────┘
│
gRPC (port 50051)
│
┌──────────┬───────┴───┬──────────┐
Worker1 Worker2 Worker3 Worker4
(Go) (Go) (Go) (Go)
proto/ gRPC protocol definition
trainer.proto Service + message definitions
trainer.pb.go Auto-generated message code
trainer_grpc.pb.go Auto-generated service code
master/ Master node (runs on one machine)
gradient_server.go Main entry point + gRPC handlers
worker_manager.go Worker health monitoring helpers
schedule.go Task scheduling logic
worker/ Worker node (runs on each machine)
worker.go Main entry point + registration
heartbeat.go Heartbeat/keep-alive system
training/ Python training scripts (Step 7)
dataset/ Training data shards
storage/ Model checkpoints (Step 9)
Download and install Go (1.25+): https://go.dev/dl/
Verify:
go versionWindows:
# Using Chocolatey
choco install protobuf
# OR download manually from:
# https://github.com/protocolbuffers/protobuf/releases
# Add protoc.exe to your PATHmacOS:
brew install protobufLinux (Ubuntu/Debian):
sudo apt install -y protobuf-compilerVerify:
protoc --versiongo install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latestMake sure $GOPATH/bin (or %GOPATH%\bin on Windows) is in your PATH:
# Linux/macOS — add to ~/.bashrc or ~/.zshrc
export PATH="$PATH:$(go env GOPATH)/bin"
# Windows (PowerShell)
$env:PATH += ";$(go env GOPATH)\bin"Download Python 3.10+: https://www.python.org/downloads/
git clone https://github.com/ShapeToFashion/Distributed-training-system.git
cd Distributed-training-systemgo mod tidyIf you modify proto/trainer.proto, regenerate the Go bindings:
protoc --go_out=. --go_opt=paths=source_relative \
--go-grpc_out=. --go-grpc_opt=paths=source_relative \
proto/trainer.protoThis generates:
proto/trainer.pb.go— message serialization codeproto/trainer_grpc.pb.go— gRPC client/server stubs
Terminal 1 — Start Master:
go run master/gradient_server.go master/worker_manager.go master/schedule.goMaster listens on port 50051.
Terminal 2 — Start Worker 1:
go run worker/worker.go worker/heartbeat.go -id=worker1Terminal 3 — Start Worker 2:
go run worker/worker.go worker/heartbeat.go -id=worker2-
Master IP (Kartik's machine):
IPv4 Address: 10.56.89.116 Port: 50051 -
Start the master:
go run master/gradient_server.go master/worker_manager.go master/schedule.go
Master will listen on
0.0.0.0:50051(all interfaces). -
Open firewall for port 50051:
# Windows (run as Administrator) netsh advfirewall firewall add rule name="gRPC Master" dir=in action=allow protocol=TCP localport=50051 # Linux sudo ufw allow 50051/tcp
-
Clone the repo and install dependencies:
git clone https://github.com/ShapeToFashion/Distributed-training-system.git cd Distributed-training-system go mod tidy -
Connect worker to master:
go run worker/worker.go worker/heartbeat.go -id=worker1 -master=10.56.89.116:50051
go run worker/worker.go worker/heartbeat.go -id=test-worker -master=10.56.89.116:50051 -testThis registers with the master and exits — useful to verify the gRPC link works.
Defined in proto/trainer.proto:
| RPC Method | Direction | Purpose |
|---|---|---|
RegisterWorker |
Worker → Master | Worker joins the cluster |
SendHeartbeat |
Worker → Master | Keep-alive ping (every 5s) |
GetTask |
Worker → Master | Worker polls for a training task (shard) |
GetWeights |
Worker → Master | Worker fetches latest model weights |
SendGradients |
Worker → Master | Worker sends computed gradients |
SaveCheckpoint |
Internal | Master saves model checkpoint to disk |
Default port: 50051
| Problem | Fix |
|---|---|
connection refused |
Check master is running and IP/port are correct |
context deadline exceeded |
Firewall may be blocking port 50051 |
protoc-gen-go: program not found |
Run go install commands above and add GOPATH/bin to PATH |
module not found errors |
Run go mod tidy in the project root |
| Workers disconnect | Check heartbeat logs; master marks workers dead after 15s of silence |
- Step 1: gRPC protocol definition
- Step 2: Master server
- Step 3: Worker node
- Step 4: Worker registration
- Step 5: Heartbeat system
- Step 6: Dataset distribution
- Step 7: Training execution
- Step 8: Gradient aggregation
- Step 9: Checkpointingeck master is running and IP/port are correct |
|
context deadline exceeded| Firewall may be blocking port 50051 | |protoc-gen-go: program not found| Rungo installcommands above and add GOPATH/bin to PATH | |module not founderrors | Rungo mod tidyin the project root | | Workers disconnect | Check heartbeat logs; master marks workers dead after 15s of silence |
- Step 1: gRPC protocol definition
- Step 2: Master server
- Step 3: Worker node
- Step 4: Worker registration
- Step 5: Heartbeat system
- Step 6: Dataset distribution
- Step 7: Training execution
- Step 8: Gradient aggregation
- Step 9: Checkpointing