Skip to main content
Build Swarm

Build Swarm Architecture

Production architecture for the distributed Gentoo package build system with fault tolerance and atomic releases

January 28, 2026

Gentoo Build Swarm - Production Architecture

Version: 0.4.0 Status: Production Ready

Overview

A distributed, fault-tolerant package build system for Gentoo Linux with atomic releases and automatic failover.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    CLIENT (Driver)                          │
│                   10.42.0.100                               │
│                                                             │
│  $ emerge -puDN @world | parse                             │
│  $ POST http://10.42.0.199:8090/build                      │
│  $ GET http://10.42.0.199:8090/binhost → get URL           │
│  $ emerge -uDN @world (uses returned binhost)              │
└────────────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              COORDINATOR/GATEWAY (Altair-Link)              │
│              10.42.0.199:8090 (Always On)                   │
│                                                             │
│  Role: Stateless Router                                    │
│  • Health-check orchestrators (30s interval)               │
│  • Route /build requests to active orchestrator            │
│  • Return dynamic binhost URL                              │
│  • NO package storage (lightweight)                        │
└────────────────────────┬────────────────────────────────────┘

           ┌─────────────┴─────────────┐
           ▼                           ▼
┌──────────────────────┐    ┌──────────────────────┐
│ orch-Izar-Host              │    │ orch-Tarn-Host           │
│ 10.42.0.201 (Primary)│    │ 100.64.0.16.118       │
│ Proxmox Izar-Host LXC 121   │    │ Proxmox Tarn-Host LXC102 │
│                      │    │ (Backup/Failover)    │
│ swarm-coordinator.py │    │ swarm-coordinator.py │
│                      │    │                      │
│ State Tracking:      │    │ Same code, passive   │
│ • needed[]           │    │ until Izar-Host fails       │
│ • delegated{}        │    │                      │
│ • received[]         │    │                      │
│ • failed{}           │    │                      │
│                      │    │                      │
│ Storage (2-tier):    │    │ Storage (2-tier):    │
│ STAGING:             │    │ STAGING:             │
│ /var/cache/          │    │ /var/cache/          │
│   binpkgs-staging/   │    │   binpkgs-staging/   │
│   (builds land here) │    │   (builds land here) │
│                      │    │                      │
│ PRODUCTION:          │    │ PRODUCTION:          │
│ /var/cache/binpkgs/  │    │ /var/cache/binpkgs/  │
│   (nginx serves this)│    │   (nginx serves this)│
│   http://.../packages│    │   http://.../packages│
└──────────┬───────────┘    └──────────────────────┘

           │ Delegates Work

    ┌──────┴──────┬──────────────┐
    ▼             ▼              ▼
┌─────────┐  ┌─────────┐  ┌─────────┐
│ DRONE   │  │ DRONE   │  │ DRONE   │
│drone-   │  │ drone-  │  │ (future)│
│ Izar-Host      │  │ Tarn-Host   │  │         │
│10.42.0. │  │100.64.0.  │  │         │
│  203    │  │27.91    │  │         │
│16 cores │  │14 cores │  │         │
└─────────┘  └─────────┘  └─────────┘

All drones:
• Poll orchestrator for work
• emerge --buildpkg --oneshot <pkg>
• rsync to orchestrator STAGING
• Delete local copy

Package Flow (Build → Release)

┌─────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   DRONE     │───▶│     STAGING      │───▶│   PRODUCTION    │
│             │    │                  │    │                 │
│ emerge      │    │ /var/cache/      │    │ /var/cache/     │
│ --buildpkg  │    │ binpkgs-staging/ │    │ binpkgs/        │
│             │    │                  │    │                 │
│ rsync ──────│───▶│ validation       │───▶│ nginx serves    │
│             │    │ (size check)     │    │ Packages index  │
└─────────────┘    └──────────────────┘    └─────────────────┘


                   Orchestrator validates:
                   1. File exists (GPKG flat format)
                   2. Size > 1KB (not junk)
                   3. Move to production
                   4. Mark package as received

Binary Format: Modern Portage uses flat GPKG files:

  • Path: /var/cache/binpkgs/{category}/{name}-{version}.gpkg.tar
  • Example: /var/cache/binpkgs/kde-frameworks/kconfig-6.22.0.gpkg.tar

Release Logic (v0.4.0+):

  • On successful validation, binary auto-moves from staging → production
  • shutil.move() atomic operation ensures no partial transfers
  • Packages index regenerated by nginx or emaint binhost --fix

Component Details

1. Coordinator (Gateway) - 10.42.0.199

Hardware: Ubuntu LXC on Altair-Link Software: swarm-gateway.py Port: 8090

Responsibilities:

  • Health-check orchestrators every 30 seconds
  • Route build requests to active orchestrator
  • Provide dynamic binhost URL resolution
  • NO state tracking (orchestrator handles this)
  • NO package storage (serves as pointer only)

Endpoints:

  • GET /status - Gateway health and orchestrator status
  • GET /binhost - Returns URL of active orchestrator’s binhost
  • POST /build - Submit package list (routes to active orch)
  • GET /swarm - Fetch swarm status from active orch

Config: /opt/swarm-gateway/config.json

{
  "orchestrators": [
    {"name": "orch-Izar-Host", "ip": "10.42.0.201", "priority": 100},
    {"name": "orch-Tarn-Host", "ip": "100.64.0.16.118", "priority": 200}
  ],
  "check_interval": 30
}

2. Orchestrators - 10.42.0.201 (Primary) & 100.64.0.16.118 (Backup)

Hardware: Gentoo LXC containers Software: swarm-coordinator.py State File: /var/lib/build-swarm/state.json

Responsibilities:

  • Accept package lists via init-from-list command
  • Maintain build queue state
  • Delegate work to drones (load-balanced by cores)
  • Track package status (needed/delegated/received/failed)
  • Collect built packages in STAGING directory
  • Release to PRODUCTION on finalize command

Two-Tier Storage Model:

STAGING (/var/cache/binpkgs-staging/):

  • Drones upload here
  • NOT served by nginx
  • Work-in-progress, potentially incomplete builds
  • Safe zone for active builds

PRODUCTION (/var/cache/binpkgs/):

  • Nginx serves from here
  • Only receives packages via finalize command
  • Atomic release (all or nothing)
  • Source of truth for clients

Commands:

# Initialize build from package list
swarm-coordinator.py init-from-list packages.txt

# Check status
swarm-coordinator.py status

# Redistribute work
swarm-coordinator.py balance

# ATOMIC RELEASE: Move staging → production
swarm-coordinator.py finalize

# Retry failed packages
swarm-coordinator.py unblock

3. Drones - Multiple Builders

Software: build-worker.sh Behavior:

  1. Poll orchestrator every 10 seconds
  2. Request next package
  3. Build with emerge --buildpkg --oneshot <pkg>
  4. Upload to orchestrator’s STAGING directory
  5. Delete local binary
  6. Report success/failure
  7. Repeat

Current Drones:

  • drone-Izar-Host (10.42.0.203) - 16 cores, local network
  • drone-Tarn (100.64.0.27.91) - 14 cores, via Tailscale

Safety Mechanisms

1. Atomic Updates (Staging/Production Split)

Problem: Partial builds reaching clients can brick systems.

Solution:

  • Drones upload to STAGING (hidden from nginx)
  • Admin manually runs finalize when 100% complete
  • finalize atomically syncs staging → production
  • Clients only ever see complete, verified builds

Workflow:

# 1. Start build
swarm-coordinator.py init-from-list packages.txt

# 2. Monitor progress
swarm-monitor.py

# 3. When complete (needed=0, delegated=0)
swarm-coordinator.py finalize

# 4. Packages now available to clients

2. Dynamic Binhost Resolution

Problem: If primary orchestrator dies, clients have hardcoded broken URL.

Solution:

  • Client asks Coordinator: GET http://10.42.0.199:8090/binhost
  • Coordinator checks health of orchestrators
  • Returns URL of active orchestrator:
    • orch-Izar-Host up: http://10.42.0.201/packages
    • orch-Izar-Host down: http://100.64.0.16.118/packages

3. Failover Behavior

Scenario: orch-Izar-Host crashes mid-build

Auto-Recovery:

  1. Coordinator detects orch-Izar-Host offline (health check)
  2. Routes new requests to orch-Tarn-Host
  3. Returns orch-Tarn-Host’s binhost URL to clients
  4. Drones automatically connect to orch-Tarn-Host

Data Considerations:

  • Packages built on orch-Izar-Host are not automatically synced to orch-Tarn-Host
  • Option A: Accept data loss, rebuild on orch-Tarn-Host (simpler)
  • Option B: Implement periodic rsync orch-Izar-Host→orch-Tarn-Host (complex)
  • Current Implementation: Option A (manual recovery)

Multi-Network Architecture

The swarm operates across two physical networks connected via Tailscale:

┌──────────────────────────────────────────┐
│      Milky Way (10.42.0.0/24)          │
│                                          │
│  Capella-Outpost (Driver)   10.42.0.100 │
│  gateway-Altair             10.42.0.199 │
│  orch-Izar-Host (Primary)           10.42.0.201 │
│  drone-Izar-Host                    10.42.0.203 │
│  drone-Tau-Host (LXC)         10.42.0.184 │
└──────────────────────────────────────────┘

               │ Tailscale VPN

┌──────────────────────────────────────────┐
│       Andromeda (192.168.20.0/24)   │
│                                          │
│  Tarn-Host (Subnet Rtr) 192.168.20.100│
│  orch-Tarn-Host (Backup)        192.168.20.25 │
│  drone-Tarn                192.168.20.196│
│  drone-Meridian-Host (VM)       192.168.20.77 │
└──────────────────────────────────────────┘

Monitoring

Interactive Monitor (TUI)

cd ~/Development/gentoo-build-swarm
python3 scripts/swarm-monitor.py

Keybindings:

  • b - Balance workload
  • u - Unblock failed packages
  • R - Reset swarm (WARNING: destructive)

CLI Status

# Coordinator status
curl http://10.42.0.199:8090/status | jq

# Orchestrator status
ssh [email protected] 'python3 /var/lib/build-swarm/swarm-coordinator.py status'

# Drone logs
ssh [email protected] 'tail -f /var/log/build-worker.log'

Known Limitations

  1. No automatic sync between orchestrators

    • If primary builds packages and crashes, backup starts fresh
    • Workaround: Manual rsync between orchestrators
  2. Profile consistency required

    • All drones must use same Gentoo profile as driver
    • Mismatched profiles cause build failures
  3. Single active orchestrator

    • Only one orchestrator manages state at a time
    • Cannot parallelize across orchestrators

Future Enhancements

  1. Automatic orchestrator synchronization (rsync daemon)
  2. Multi-client job queuing
  3. Package versioning/snapshots
  4. Web dashboard for monitoring
  5. Prometheus metrics export
build-swarmarchitecturedistributed-systemsgentoo