Building an Audiobook Factory on a Homelab GPU
I wanted to turn ebooks into audiobooks on my own hardware. Not through Audible. Not through a cloud service that charges per character. On my RTX 4070 Ti, in my living room, using open-source voice models.
ArgoVox v2 does exactly that. Upload an EPUB, MOBI, AZW3, DOCX, or plain text file. Pick a voice. Hit generate. Walk away. Come back to a finished audiobook with chapters.
Eight voices. Emotion-aware synthesis. LLM-based character identification. GPU-accelerated on the homelab. The whole pipeline runs local.
What ArgoVox Does
The pitch is simple. Text goes in, audiobooks come out. The engineering underneath is anything but simple.
Here's the full pipeline:
- Parse -- Extract raw text from whatever format you throw at it (EPUB, MOBI, AZW3, FB2, DOCX, RTF, HTML, plain text)
- Clean -- Strip table of contents, page numbers, headers, copyright blocks, and other book cruft
- Analyze -- Identify characters and speakers using an LLM
- Prepare -- Add emotion cues, rewrite for pacing, apply narration optimization
- Chunk -- Split into paragraph-aware segments sized for the TTS engine
- Synthesize -- Feed each chunk to the voice model
- Package -- Assemble into a finished M4B with chapter markers
Each step is configurable. Want minimal processing? Skip the rewriting and emotion detection. Want the full treatment? Turn everything on and let the LLM do its work.
Orpheus TTS: The Voice Engine
The star of the show is Orpheus TTS. Open-source voice model. 8 distinct voices. And the key differentiator -- it understands emotion.
Most TTS engines produce flat, robotic narration. Orpheus does something different. It accepts inline emotion tags in the text. When a character is angry, the voice sounds angry. When there's a whisper, the voice drops. When there's laughter, you hear it in the tone.
The model runs as a GGUF on llama.cpp, backed by a FastAPI wrapper that exposes an OpenAI-compatible /v1/audio/speech endpoint. That means any tool that speaks OpenAI TTS can point at my local server and get Orpheus quality instead.
Eight voices cover the range you need for narration:
- Male and female options
- Different ages and tonal qualities
- Each voice responds to the same emotion tags
The GPU acceleration matters here. On the RTX 4070 Ti with 12GB VRAM, synthesis runs significantly faster than real-time. A 10-minute chapter doesn't take 10 minutes to generate.
Character Identification
This is the feature that makes ArgoVox more than a text-to-speech wrapper. When you enable character analysis, ArgoVox sends the text through an LLM to identify who's speaking.
The current model is qwen3:8b running on Ollama. It reads the text and returns:
- Character names
- Approximate dialogue attribution
- Speaker context (protagonist, antagonist, narrator, supporting)
Right now, character identification is informational. The system detects characters and shows you who's in the book. Full multi-voice routing -- where each character gets their own distinct voice automatically -- is the next phase.
But even in its current state, the analysis is useful. You can see the character breakdown before generating, verify the LLM caught the right speakers, and plan your voice assignments. When multi-voice routing ships, the character map will drive the entire synthesis pipeline.
Preprocessing: The Secret Weapon
Raw book text sounds terrible when read aloud. ArgoVox preprocessing fixes that.
Auto-Clean Narration Junk
One click strips:
- Table of contents blocks
- Page numbers and headers
- Copyright notices
- Publisher metadata
- Chapter title repetitions
- Formatting artifacts from ebook conversion
The cleaner runs multiple passes until the output stabilizes. Not one pass and hope for the best. It iterates until there's nothing left to clean.
Emotion Detection
Three tiers:
None -- straight text, no emotion processing. Fast and predictable.
Free tier -- pattern-based detection. Exclamation points, question marks, dialogue context, and keyword triggers map to emotion tags. Zero API cost.
Local AI tier -- Ollama qwen3:8b analyzes each section and assigns nuanced emotion cues. More accurate, uses local GPU compute. Still free since it runs on my hardware.
Narrative Rewriting
Optional and off by default. When enabled, the rewriter optimizes text for spoken delivery. It smooths awkward transitions, adjusts sentence length for natural pacing, and removes constructions that read well on paper but sound weird spoken aloud.
The default philosophy is fidelity-first. The book should sound like the book. Rewriting is there for cases where raw text produces bad audio -- dense academic prose, heavily formatted technical content, or text with lots of inline citations.
Smart Chunking
TTS engines have limits on input length. You can't feed an entire novel into Orpheus in one shot. ArgoVox splits text into chunks intelligently.
Paragraph-aware chunking means splits happen at natural boundaries. Not mid-sentence. Not mid-word. At paragraph breaks, scene transitions, and chapter boundaries. Each chunk is sized to stay within the engine's optimal range while preserving narrative flow.
This sounds simple. It isn't. Books have wildly inconsistent formatting. Some paragraphs are one sentence. Some are two pages. Some chapters are 500 words. Some are 15,000. The chunker handles all of it and produces consistent, engine-friendly segments.
The Output: M4B with Chapters
The final output is an M4B file with chapter markers. M4B is the standard audiobook format -- it's what you'd get from Audible or Apple Books. Chapters are navigable. You can skip forward, jump to a specific chapter, and pick up where you left off.
MP3 output is also available for simpler use cases. The packaging pipeline uses ffmpeg for the final encoding pass, which means the output is high quality and properly tagged with metadata.
The Benchmark System
I built a benchmark mode for comparing engine quality and preprocessing impact.
Pick a sample length: 1 minute, 2 minutes, 3 minutes, or 5 minutes. ArgoVox extracts one shared excerpt, creates a frozen "control" text and a frozen "processed" text, then synthesizes both through each selected engine.
The result is a direct A/B comparison. Same text, same duration, different processing. You hear exactly what emotion detection adds. You hear exactly what narrative rewriting changes. The benchmark trims output to exact requested length, so comparisons are apples-to-apples.
Benchmark modes available:
- Quick -- one engine, one comparison
- Kokoro Matrix -- multiple preprocessing profiles through Kokoro
- Orpheus Matrix -- multiple preprocessing profiles through Orpheus
- Full Matrix -- every engine, every profile
Matrix benchmarks with Minimal, Auto, and Local AI profiles reveal exactly where the preprocessing effort pays off. Orpheus shows the biggest improvement from emotion tags. Edge TTS shows the smallest. Good to know before committing to a full-book render.
Three TTS Engines
ArgoVox supports three engines. Each has a different profile.
Orpheus (Local GPU)
The premium option. 8 voices, emotion-aware, best quality. Runs on llama.cpp with the GGUF model. Requires GPU. This is what I use for anything I'm going to actually listen to.
Edge TTS (Cloud)
Microsoft's cloud TTS service. 33 voices, reliable, zero cost. No GPU needed. Quality is good but flat -- it doesn't respond to emotion tags the way Orpheus does. Useful as a fast preview or for cases where GPU isn't available.
Kokoro (Local GPU)
Was the primary engine in ArgoVox v1. 54 voices, GPU-accelerated. Currently not installed in the v2 environment but the integration exists. When re-enabled, it'll provide another local option with a different voice character than Orpheus.
The Web UI
ArgoVox runs as a FastAPI service on port 8090 with a web interface.
The UI shows:
- File upload for ebook formats
- Voice selection with engine indicator
- Preprocessing preset picker (Minimal, Auto, Local AI, Custom)
- Character analysis toggle
- Job queue with progress tracking
- Library of completed audiobooks
- Settings display showing what was used for each generation
The library is a nice touch. Every finished audiobook shows which voice was used, which preprocessing was applied, and which engine generated it. When I'm comparing results across runs, I can see exactly what changed between versions.
The Distributed Vision
ArgoVox v2 is designed as a standalone audiobook workstation. But the architecture is built for something bigger.
The plan: ArgoVox nodes become processing workers that register with ArgoBox. ArgoBox becomes the control plane. You upload a book through argobox.com, ArgoBox routes the job to the best available GPU node, and the finished audiobook appears in your library.
Each node advertises its capabilities:
- Which engines are installed (Orpheus, Kokoro, Edge)
- Available GPU model and VRAM
- Current queue depth
- Supported output formats
- Feature flags (multi-voice, chaptered output, etc.)
ArgoBox routes jobs based on capability matching, queue depth, user access policy, and node health. Want Orpheus quality? Route to a GPU node. Just need a quick preview? Route to an Edge TTS node.
Nodes can join pools -- Kronos, Jove, or custom pools. Admins control which users can access which pools. A friend gets access to one GPU node. The studio team gets priority on the premium hardware.
That's the vision. Right now, it's one node on one box with one GPU. But the architecture is ready for the fleet.
Hardware
ArgoVox v2 runs on callisto-odin:
- RTX 4070 Ti (12GB VRAM)
- Ollama for LLM inference (qwen3:8b)
- llama.cpp for Orpheus model serving
- Python 3.10, FastAPI, SQLite for job management
The GPU is the bottleneck and the enabler. Without it, you're limited to Edge TTS (cloud) or very slow CPU synthesis. With it, Orpheus runs faster than real-time and the whole pipeline becomes practical for full-length books.
What's Next
Immediate priorities:
- Multi-voice routing -- each detected character speaks in a different voice automatically
- Reliable Orpheus packaging -- smaller MP3 files, fewer encoding artifacts
- Real M4B chapter markers -- chapter-aware generation from the start, not post-processing
- Persistent project state -- save and resume generation jobs
After that:
- ArgoBox node registration -- connect to the control plane
- Remote job submission -- upload from argobox.com, process on the homelab
- Cost-aware routing -- free tier uses Edge, premium tier uses Orpheus
Drop a book in. Get an audiobook out. On my hardware, with my voices, under my control. That's the whole point. And it works.