Build Swarm Architecture
Production architecture for the distributed Gentoo package build system with fault tolerance and atomic releases
Gentoo Build Swarm - Production Architecture
Version: 0.4.0 Status: Production Ready
Overview
A distributed, fault-tolerant package build system for Gentoo Linux with atomic releases and automatic failover.
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ CLIENT (Driver) │
│ 10.42.0.100 │
│ │
│ $ emerge -puDN @world | parse │
│ $ POST http://10.42.0.199:8090/build │
│ $ GET http://10.42.0.199:8090/binhost → get URL │
│ $ emerge -uDN @world (uses returned binhost) │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ COORDINATOR/GATEWAY (Altair-Link) │
│ 10.42.0.199:8090 (Always On) │
│ │
│ Role: Stateless Router │
│ • Health-check orchestrators (30s interval) │
│ • Route /build requests to active orchestrator │
│ • Return dynamic binhost URL │
│ • NO package storage (lightweight) │
└────────────────────────┬────────────────────────────────────┘
│
┌─────────────┴─────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ orch-Izar-Host │ │ orch-Tarn-Host │
│ 10.42.0.201 (Primary)│ │ 100.64.0.16.118 │
│ Proxmox Izar-Host LXC 121 │ │ Proxmox Tarn-Host LXC102 │
│ │ │ (Backup/Failover) │
│ swarm-coordinator.py │ │ swarm-coordinator.py │
│ │ │ │
│ State Tracking: │ │ Same code, passive │
│ • needed[] │ │ until Izar-Host fails │
│ • delegated{} │ │ │
│ • received[] │ │ │
│ • failed{} │ │ │
│ │ │ │
│ Storage (2-tier): │ │ Storage (2-tier): │
│ STAGING: │ │ STAGING: │
│ /var/cache/ │ │ /var/cache/ │
│ binpkgs-staging/ │ │ binpkgs-staging/ │
│ (builds land here) │ │ (builds land here) │
│ │ │ │
│ PRODUCTION: │ │ PRODUCTION: │
│ /var/cache/binpkgs/ │ │ /var/cache/binpkgs/ │
│ (nginx serves this)│ │ (nginx serves this)│
│ http://.../packages│ │ http://.../packages│
└──────────┬───────────┘ └──────────────────────┘
│
│ Delegates Work
│
┌──────┴──────┬──────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ DRONE │ │ DRONE │ │ DRONE │
│drone- │ │ drone- │ │ (future)│
│ Izar-Host │ │ Tarn-Host │ │ │
│10.42.0. │ │100.64.0. │ │ │
│ 203 │ │27.91 │ │ │
│16 cores │ │14 cores │ │ │
└─────────┘ └─────────┘ └─────────┘
All drones:
• Poll orchestrator for work
• emerge --buildpkg --oneshot <pkg>
• rsync to orchestrator STAGING
• Delete local copy
Package Flow (Build → Release)
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ DRONE │───▶│ STAGING │───▶│ PRODUCTION │
│ │ │ │ │ │
│ emerge │ │ /var/cache/ │ │ /var/cache/ │
│ --buildpkg │ │ binpkgs-staging/ │ │ binpkgs/ │
│ │ │ │ │ │
│ rsync ──────│───▶│ validation │───▶│ nginx serves │
│ │ │ (size check) │ │ Packages index │
└─────────────┘ └──────────────────┘ └─────────────────┘
│
▼
Orchestrator validates:
1. File exists (GPKG flat format)
2. Size > 1KB (not junk)
3. Move to production
4. Mark package as received
Binary Format: Modern Portage uses flat GPKG files:
- Path:
/var/cache/binpkgs/{category}/{name}-{version}.gpkg.tar - Example:
/var/cache/binpkgs/kde-frameworks/kconfig-6.22.0.gpkg.tar
Release Logic (v0.4.0+):
- On successful validation, binary auto-moves from staging → production
shutil.move()atomic operation ensures no partial transfers- Packages index regenerated by nginx or
emaint binhost --fix
Component Details
1. Coordinator (Gateway) - 10.42.0.199
Hardware: Ubuntu LXC on Altair-Link
Software: swarm-gateway.py
Port: 8090
Responsibilities:
- Health-check orchestrators every 30 seconds
- Route build requests to active orchestrator
- Provide dynamic binhost URL resolution
- NO state tracking (orchestrator handles this)
- NO package storage (serves as pointer only)
Endpoints:
GET /status- Gateway health and orchestrator statusGET /binhost- Returns URL of active orchestrator’s binhostPOST /build- Submit package list (routes to active orch)GET /swarm- Fetch swarm status from active orch
Config: /opt/swarm-gateway/config.json
{
"orchestrators": [
{"name": "orch-Izar-Host", "ip": "10.42.0.201", "priority": 100},
{"name": "orch-Tarn-Host", "ip": "100.64.0.16.118", "priority": 200}
],
"check_interval": 30
}
2. Orchestrators - 10.42.0.201 (Primary) & 100.64.0.16.118 (Backup)
Hardware: Gentoo LXC containers
Software: swarm-coordinator.py
State File: /var/lib/build-swarm/state.json
Responsibilities:
- Accept package lists via
init-from-listcommand - Maintain build queue state
- Delegate work to drones (load-balanced by cores)
- Track package status (needed/delegated/received/failed)
- Collect built packages in STAGING directory
- Release to PRODUCTION on
finalizecommand
Two-Tier Storage Model:
STAGING (/var/cache/binpkgs-staging/):
- Drones upload here
- NOT served by nginx
- Work-in-progress, potentially incomplete builds
- Safe zone for active builds
PRODUCTION (/var/cache/binpkgs/):
- Nginx serves from here
- Only receives packages via
finalizecommand - Atomic release (all or nothing)
- Source of truth for clients
Commands:
# Initialize build from package list
swarm-coordinator.py init-from-list packages.txt
# Check status
swarm-coordinator.py status
# Redistribute work
swarm-coordinator.py balance
# ATOMIC RELEASE: Move staging → production
swarm-coordinator.py finalize
# Retry failed packages
swarm-coordinator.py unblock
3. Drones - Multiple Builders
Software: build-worker.sh
Behavior:
- Poll orchestrator every 10 seconds
- Request next package
- Build with
emerge --buildpkg --oneshot <pkg> - Upload to orchestrator’s STAGING directory
- Delete local binary
- Report success/failure
- Repeat
Current Drones:
drone-Izar-Host(10.42.0.203) - 16 cores, local networkdrone-Tarn(100.64.0.27.91) - 14 cores, via Tailscale
Safety Mechanisms
1. Atomic Updates (Staging/Production Split)
Problem: Partial builds reaching clients can brick systems.
Solution:
- Drones upload to STAGING (hidden from nginx)
- Admin manually runs
finalizewhen 100% complete finalizeatomically syncs staging → production- Clients only ever see complete, verified builds
Workflow:
# 1. Start build
swarm-coordinator.py init-from-list packages.txt
# 2. Monitor progress
swarm-monitor.py
# 3. When complete (needed=0, delegated=0)
swarm-coordinator.py finalize
# 4. Packages now available to clients
2. Dynamic Binhost Resolution
Problem: If primary orchestrator dies, clients have hardcoded broken URL.
Solution:
- Client asks Coordinator:
GET http://10.42.0.199:8090/binhost - Coordinator checks health of orchestrators
- Returns URL of active orchestrator:
- orch-Izar-Host up:
http://10.42.0.201/packages - orch-Izar-Host down:
http://100.64.0.16.118/packages
- orch-Izar-Host up:
3. Failover Behavior
Scenario: orch-Izar-Host crashes mid-build
Auto-Recovery:
- Coordinator detects orch-Izar-Host offline (health check)
- Routes new requests to orch-Tarn-Host
- Returns orch-Tarn-Host’s binhost URL to clients
- Drones automatically connect to orch-Tarn-Host
Data Considerations:
- Packages built on orch-Izar-Host are not automatically synced to orch-Tarn-Host
- Option A: Accept data loss, rebuild on orch-Tarn-Host (simpler)
- Option B: Implement periodic
rsyncorch-Izar-Host→orch-Tarn-Host (complex) - Current Implementation: Option A (manual recovery)
Multi-Network Architecture
The swarm operates across two physical networks connected via Tailscale:
┌──────────────────────────────────────────┐
│ Milky Way (10.42.0.0/24) │
│ │
│ Capella-Outpost (Driver) 10.42.0.100 │
│ gateway-Altair 10.42.0.199 │
│ orch-Izar-Host (Primary) 10.42.0.201 │
│ drone-Izar-Host 10.42.0.203 │
│ drone-Tau-Host (LXC) 10.42.0.184 │
└──────────────────────────────────────────┘
│
│ Tailscale VPN
▼
┌──────────────────────────────────────────┐
│ Andromeda (192.168.20.0/24) │
│ │
│ Tarn-Host (Subnet Rtr) 192.168.20.100│
│ orch-Tarn-Host (Backup) 192.168.20.25 │
│ drone-Tarn 192.168.20.196│
│ drone-Meridian-Host (VM) 192.168.20.77 │
└──────────────────────────────────────────┘
Monitoring
Interactive Monitor (TUI)
cd ~/Development/gentoo-build-swarm
python3 scripts/swarm-monitor.py
Keybindings:
b- Balance workloadu- Unblock failed packagesR- Reset swarm (WARNING: destructive)
CLI Status
# Coordinator status
curl http://10.42.0.199:8090/status | jq
# Orchestrator status
ssh [email protected] 'python3 /var/lib/build-swarm/swarm-coordinator.py status'
# Drone logs
ssh [email protected] 'tail -f /var/log/build-worker.log'
Known Limitations
-
No automatic sync between orchestrators
- If primary builds packages and crashes, backup starts fresh
- Workaround: Manual
rsyncbetween orchestrators
-
Profile consistency required
- All drones must use same Gentoo profile as driver
- Mismatched profiles cause build failures
-
Single active orchestrator
- Only one orchestrator manages state at a time
- Cannot parallelize across orchestrators
Future Enhancements
- Automatic orchestrator synchronization (rsync daemon)
- Multi-client job queuing
- Package versioning/snapshots
- Web dashboard for monitoring
- Prometheus metrics export