Building an Audiobook Factory on a Homelab GPU

I wanted to turn ebooks into audiobooks on my own hardware. Not through Audible. Not through a cloud service that charges per character. On my RTX 4070 Ti, in my living room, using open-source voice models.

ArgoVox v2 does exactly that. Upload an EPUB, MOBI, AZW3, DOCX, or plain text file. Pick a voice. Hit generate. Walk away. Come back to a finished audiobook with chapters.

Eight voices. Emotion-aware synthesis. LLM-based character identification. GPU-accelerated on the homelab. The whole pipeline runs local.

What ArgoVox Does

The pitch is simple. Text goes in, audiobooks come out. The engineering underneath is anything but simple.

Here's the full pipeline:

Parse -- Extract raw text from whatever format you throw at it (EPUB, MOBI, AZW3, FB2, DOCX, RTF, HTML, plain text)
Clean -- Strip table of contents, page numbers, headers, copyright blocks, and other book cruft
Analyze -- Identify characters and speakers using an LLM
Prepare -- Add emotion cues, rewrite for pacing, apply narration optimization
Chunk -- Split into paragraph-aware segments sized for the TTS engine
Synthesize -- Feed each chunk to the voice model
Package -- Assemble into a finished M4B with chapter markers

Each step is configurable. Want minimal processing? Skip the rewriting and emotion detection. Want the full treatment? Turn everything on and let the LLM do its work.

Orpheus TTS: The Voice Engine

The star of the show is Orpheus TTS. Open-source voice model. 8 distinct voices. And the key differentiator -- it understands emotion.

Most TTS engines produce flat, robotic narration. Orpheus does something different. It accepts inline emotion tags in the text. When a character is angry, the voice sounds angry. When there's a whisper, the voice drops. When there's laughter, you hear it in the tone.

The model runs as a GGUF on llama.cpp, backed by a FastAPI wrapper that exposes an OpenAI-compatible /v1/audio/speech endpoint. That means any tool that speaks OpenAI TTS can point at my local server and get Orpheus quality instead.

Eight voices cover the range you need for narration:

Male and female options
Different ages and tonal qualities
Each voice responds to the same emotion tags

The GPU acceleration matters here. On the RTX 4070 Ti with 12GB VRAM, synthesis runs significantly faster than real-time. A 10-minute chapter doesn't take 10 minutes to generate.

Character Identification

This is the feature that makes ArgoVox more than a text-to-speech wrapper. When you enable character analysis, ArgoVox sends the text through an LLM to identify who's speaking.

The current model is qwen3:8b running on Ollama. It reads the text and returns:

Character names
Approximate dialogue attribution
Speaker context (protagonist, antagonist, narrator, supporting)

Right now, character identification is informational. The system detects characters and shows you who's in the book. Full multi-voice routing -- where each character gets their own distinct voice automatically -- is the next phase.

But even in its current state, the analysis is useful. You can see the character breakdown before generating, verify the LLM caught the right speakers, and plan your voice assignments. When multi-voice routing ships, the character map will drive the entire synthesis pipeline.

Preprocessing: The Secret Weapon

Raw book text sounds terrible when read aloud. ArgoVox preprocessing fixes that.

Auto-Clean Narration Junk

One click strips:

Table of contents blocks
Page numbers and headers
Copyright notices
Publisher metadata
Chapter title repetitions
Formatting artifacts from ebook conversion

The cleaner runs multiple passes until the output stabilizes. Not one pass and hope for the best. It iterates until there's nothing left to clean.

Emotion Detection

Three tiers:

None -- straight text, no emotion processing. Fast and predictable.

Free tier -- pattern-based detection. Exclamation points, question marks, dialogue context, and keyword triggers map to emotion tags. Zero API cost.

Local AI tier -- Ollama qwen3:8b analyzes each section and assigns nuanced emotion cues. More accurate, uses local GPU compute. Still free since it runs on my hardware.

Narrative Rewriting

Optional and off by default. When enabled, the rewriter optimizes text for spoken delivery. It smooths awkward transitions, adjusts sentence length for natural pacing, and removes constructions that read well on paper but sound weird spoken aloud.

The default philosophy is fidelity-first. The book should sound like the book. Rewriting is there for cases where raw text produces bad audio -- dense academic prose, heavily formatted technical content, or text with lots of inline citations.

Smart Chunking

TTS engines have limits on input length. You can't feed an entire novel into Orpheus in one shot. ArgoVox splits text into chunks intelligently.

Paragraph-aware chunking means splits happen at natural boundaries. Not mid-sentence. Not mid-word. At paragraph breaks, scene transitions, and chapter boundaries. Each chunk is sized to stay within the engine's optimal range while preserving narrative flow.

This sounds simple. It isn't. Books have wildly inconsistent formatting. Some paragraphs are one sentence. Some are two pages. Some chapters are 500 words. Some are 15,000. The chunker handles all of it and produces consistent, engine-friendly segments.

The Output: M4B with Chapters

The final output is an M4B file with chapter markers. M4B is the standard audiobook format -- it's what you'd get from Audible or Apple Books. Chapters are navigable. You can skip forward, jump to a specific chapter, and pick up where you left off.

MP3 output is also available for simpler use cases. The packaging pipeline uses ffmpeg for the final encoding pass, which means the output is high quality and properly tagged with metadata.

The Benchmark System

I built a benchmark mode for comparing engine quality and preprocessing impact.

Pick a sample length: 1 minute, 2 minutes, 3 minutes, or 5 minutes. ArgoVox extracts one shared excerpt, creates a frozen "control" text and a frozen "processed" text, then synthesizes both through each selected engine.

The result is a direct A/B comparison. Same text, same duration, different processing. You hear exactly what emotion detection adds. You hear exactly what narrative rewriting changes. The benchmark trims output to exact requested length, so comparisons are apples-to-apples.

Benchmark modes available:

Quick -- one engine, one comparison
Kokoro Matrix -- multiple preprocessing profiles through Kokoro
Orpheus Matrix -- multiple preprocessing profiles through Orpheus
Full Matrix -- every engine, every profile

Matrix benchmarks with Minimal, Auto, and Local AI profiles reveal exactly where the preprocessing effort pays off. Orpheus shows the biggest improvement from emotion tags. Edge TTS shows the smallest. Good to know before committing to a full-book render.

Three TTS Engines

ArgoVox supports three engines. Each has a different profile.

Orpheus (Local GPU)

The premium option. 8 voices, emotion-aware, best quality. Runs on llama.cpp with the GGUF model. Requires GPU. This is what I use for anything I'm going to actually listen to.

Edge TTS (Cloud)

Microsoft's cloud TTS service. 33 voices, reliable, zero cost. No GPU needed. Quality is good but flat -- it doesn't respond to emotion tags the way Orpheus does. Useful as a fast preview or for cases where GPU isn't available.

Kokoro (Local GPU)

Was the primary engine in ArgoVox v1. 54 voices, GPU-accelerated. Currently not installed in the v2 environment but the integration exists. When re-enabled, it'll provide another local option with a different voice character than Orpheus.

The Web UI

ArgoVox runs as a FastAPI service on port 8090 with a web interface.

The UI shows:

File upload for ebook formats
Voice selection with engine indicator
Preprocessing preset picker (Minimal, Auto, Local AI, Custom)
Character analysis toggle
Job queue with progress tracking
Library of completed audiobooks
Settings display showing what was used for each generation

The library is a nice touch. Every finished audiobook shows which voice was used, which preprocessing was applied, and which engine generated it. When I'm comparing results across runs, I can see exactly what changed between versions.

The Distributed Vision

ArgoVox v2 is designed as a standalone audiobook workstation. But the architecture is built for something bigger.

The plan: ArgoVox nodes become processing workers that register with ArgoBox. ArgoBox becomes the control plane. You upload a book through argobox.com, ArgoBox routes the job to the best available GPU node, and the finished audiobook appears in your library.

Each node advertises its capabilities:

Which engines are installed (Orpheus, Kokoro, Edge)
Available GPU model and VRAM
Current queue depth
Supported output formats
Feature flags (multi-voice, chaptered output, etc.)

ArgoBox routes jobs based on capability matching, queue depth, user access policy, and node health. Want Orpheus quality? Route to a GPU node. Just need a quick preview? Route to an Edge TTS node.

Nodes can join pools -- Kronos, Jove, or custom pools. Admins control which users can access which pools. A friend gets access to one GPU node. The studio team gets priority on the premium hardware.

That's the vision. Right now, it's one node on one box with one GPU. But the architecture is ready for the fleet.

Hardware

ArgoVox v2 runs on callisto-odin:

RTX 4070 Ti (12GB VRAM)
Ollama for LLM inference (qwen3:8b)
llama.cpp for Orpheus model serving
Python 3.10, FastAPI, SQLite for job management

The GPU is the bottleneck and the enabler. Without it, you're limited to Edge TTS (cloud) or very slow CPU synthesis. With it, Orpheus runs faster than real-time and the whole pipeline becomes practical for full-length books.

What's Next

Immediate priorities:

Multi-voice routing -- each detected character speaks in a different voice automatically
Reliable Orpheus packaging -- smaller MP3 files, fewer encoding artifacts
Real M4B chapter markers -- chapter-aware generation from the start, not post-processing
Persistent project state -- save and resume generation jobs

After that:

ArgoBox node registration -- connect to the control plane
Remote job submission -- upload from argobox.com, process on the homelab
Cost-aware routing -- free tier uses Edge, premium tier uses Orpheus

Drop a book in. Get an audiobook out. On my hardware, with my voices, under my control. That's the whole point. And it works.

Building an Audiobook Factory on a Homelab GPU

Building an Audiobook Factory on a Homelab GPU

What ArgoVox Does

Orpheus TTS: The Voice Engine

Character Identification

Preprocessing: The Secret Weapon

Auto-Clean Narration Junk

Emotion Detection

Narrative Rewriting

Smart Chunking

The Output: M4B with Chapters

The Benchmark System

Three TTS Engines

Orpheus (Local GPU)

Edge TTS (Cloud)

Kokoro (Local GPU)

The Web UI

The Distributed Vision

Hardware

What's Next

System Status

🌐 Gateway

🚀 Orchestrators

🤖 Build Drones

🔨 Active Builds

Building an Audiobook Factory on a Homelab GPU

What ArgoVox Does

Orpheus TTS: The Voice Engine

Character Identification

Preprocessing: The Secret Weapon

Auto-Clean Narration Junk

Emotion Detection

Narrative Rewriting

Smart Chunking

The Output: M4B with Chapters

The Benchmark System

Three TTS Engines

Orpheus (Local GPU)

Edge TTS (Cloud)

Kokoro (Local GPU)

The Web UI

The Distributed Vision

Hardware

What's Next

Enjoyed this post?