Multi-Provider AI Architecture
Voice engine design, provider configuration, model routing, and streaming architecture for Arcturus-Prime AI services
Multi-Provider AI Architecture
Arcturus-Prime routes AI requests across four providers: OpenRouter for cost-effective and free-tier access, Anthropic for premium Claude models, Google GenAI for Gemini models, and local Ollama for embedding generation and privacy-sensitive workloads. The voice engine at src/lib/voice-engine.ts acts as the central orchestration layer, managing provider selection, model routing, streaming, and fallback logic.
Voice Engine (src/lib/voice-engine.ts)
The voice engine is the core abstraction that all AI features in Arcturus-Prime use. Rather than calling provider APIs directly, every component calls into the voice engine, which handles provider selection, authentication, request formatting, response streaming, and error recovery.
Architecture
The voice engine exposes a unified interface:
interface VoiceEngine {
chat(options: ChatOptions): AsyncGenerator<string>;
generate(options: GenerateOptions): Promise<string>;
embed(text: string | string[]): Promise<number[][]>;
score(content: string, profile: string): Promise<VoiceScore>;
}
The chat method returns an async generator that yields tokens as they arrive from the provider’s streaming API. The generate method buffers the full response and returns it as a string. The embed method generates vector embeddings for text input. The score method evaluates content against a voice profile and returns a structured score.
Provider Registration
Each provider is registered with the voice engine at startup:
engine.registerProvider('openrouter', {
baseUrl: 'https://openrouter.ai/api/v1',
apiKey: process.env.OPENROUTER_API_KEY,
models: ['deepseek/deepseek-chat-v3-0324:free', 'moonshotai/kimi-k2.5:free', ...],
streaming: true,
rateLimit: { requests: 60, window: 60000 }
});
The registration includes the base URL, authentication credentials, available models, streaming capability flag, and rate limiting configuration. The engine enforces rate limits per provider and queues requests that exceed the limit.
Fallback Logic
When a provider request fails (HTTP 5xx, timeout, rate limit exceeded), the voice engine attempts fallback to an alternative provider. The fallback chain is configurable per use case:
- Primary provider attempt
- Wait 1 second, retry same provider
- Fallback to secondary provider
- Fallback to tertiary provider
- Return error to caller
For example, if an admin chat request to Anthropic Claude fails, the engine falls back to OpenRouter DeepSeek, then to Google Gemini. The fallback chain is transparent to the calling component.
OpenRouter
OpenRouter is the primary provider for most AI interactions in Arcturus-Prime. It provides access to a wide range of models through a single API, with some models available on a free tier.
Configuration
OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
The API key is stored in the environment and injected at build/runtime. OpenRouter uses the OpenAI-compatible API format, making it a drop-in replacement for any OpenAI SDK calls.
Free Tier Models
OpenRouter offers several models at zero cost, which Arcturus-Prime uses for public-facing features:
- Kimi K2.5 (
moonshotai/kimi-k2.5:free) — the default model for the public chat widget on Arcturus-Prime.com. Selected for its strong conversational ability at zero cost. Visitors can chat with the site’s AI assistant without incurring any API charges. - DeepSeek V3 (
deepseek/deepseek-chat-v3-0324:free) — used as the fallback free model and for bulk content processing tasks where cost matters more than peak quality.
Admin Models
For admin-only features, Arcturus-Prime uses higher-capability models through OpenRouter:
- Llama 3.3 70B — used for general admin chat when cost-effectiveness is important
- Mistral Large — used for structured output tasks (JSON generation, schema validation)
- DeepSeek Coder — used for code-related conversations in the admin workbench
Model availability is checked at startup via the OpenRouter models API, and the model picker in the admin UI only shows models that are currently available.
Anthropic Claude
Anthropic Claude models are used for premium AI features in the admin panel where quality matters most.
Configuration
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Models and Use Cases
- Claude Sonnet 4 — the workhorse model for content generation, voice scoring, and writing coach features. It provides the best balance of quality, speed, and cost for writing-intensive tasks. Used by
/api/admin/content-gen,/api/admin/voice-check, and/api/admin/ai-coach. - Claude Opus 4 — reserved for the admin workbench and complex reasoning tasks. Its higher cost means it is only available to admin-role users through explicit model selection.
Speech-to-Text
Anthropic’s API is also used for speech-to-text transcription in the content pipeline. Audio recordings from voice memos or pair programming sessions are transcribed using Claude’s multimodal capabilities, then fed into the transcript-to-post pipeline.
Streaming
Anthropic streaming uses their native SSE format with event: content_block_delta events. The voice engine translates these into the unified token stream format used by all Arcturus-Prime chat interfaces. The translation handles Anthropic-specific event types:
message_start— initializes the response buffercontent_block_start— begins a new content blockcontent_block_delta— contains the actual token textcontent_block_stop— ends the current content blockmessage_delta— contains stop reason and usage metricsmessage_stop— signals completion
Google GenAI
Google GenAI provides access to Gemini models, used primarily for fact-checking and testing.
Configuration
GOOGLE_GENAI_API_KEY=AIzaXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Models
- Gemini 2.5 Pro — used for the fact-checking pipeline at
/api/admin/fact-check. Gemini’s strong grounding capability and large context window make it effective at verifying factual claims in blog content. - Gemini 2.5 Flash — used for rapid classification tasks: categorizing content, suggesting tags, and routing queries.
Testing
Google GenAI integration is tested via scripts/test-gemini-models.js, a standalone script that sends test prompts to available Gemini models and reports latency, token usage, and output quality metrics. This script runs as part of the provider health check routine and alerts if Google API connectivity degrades.
Ollama (Local)
Ollama runs locally on Capella-Outpost (10.42.0.100) at http://localhost:11434 and handles workloads that benefit from local execution: embedding generation, privacy-sensitive processing, and development/testing.
Model Auto-Discovery
On startup, the voice engine queries the Ollama API at http://localhost:11434/api/tags to discover which models are currently loaded. The response includes model names, sizes, and modification dates. Discovered models are automatically registered with the voice engine and appear in the admin model picker.
Typical models running on the Ollama instance:
- nomic-embed-text — the primary embedding model used for the RAG pipeline. Generates 768-dimensional embeddings locally without any external API calls.
- llama3.2 — a general-purpose chat model for development testing and offline use
- mistral — used for structured output generation during development
- codellama — code-specialized model for development assistance
Embedding Generation
Ollama is the default provider for embedding generation in the RAG pipeline. The scripts/build-embeddings.js script can be configured to use either Ollama or OpenRouter for embeddings. Ollama is preferred when:
- Building embeddings for sensitive content that should not leave the local network
- Running batch embedding jobs where API rate limits would be a bottleneck
- Development and testing of the RAG pipeline
The Ollama embedding endpoint at http://localhost:11434/api/embeddings accepts a model name and text input and returns a vector. The voice engine batches multiple embedding requests to reduce round-trip overhead.
RAG Pipeline Integration
The full RAG pipeline uses Ollama for both embedding generation and local inference. When a user query comes in through the RAG-enabled chat endpoint, the pipeline:
- Embeds the query using Ollama’s nomic-embed-text
- Searches the vector index for similar content chunks
- Retrieves the top-K chunks as context
- Sends the query plus context to the selected chat model (which may be any provider)
This hybrid approach keeps embedding costs at zero while allowing the final generation step to use whichever model is best for the task.
Model Routing
The voice engine implements a routing table that maps use cases to default models and providers:
| Use Case | Model | Provider | Fallback |
|---|---|---|---|
| Public chat | Kimi K2.5 | OpenRouter (free) | DeepSeek V3 (free) |
| Admin chat | DeepSeek V3 | OpenRouter | Claude Sonnet 4 |
| Content generation | Claude Sonnet 4 | Anthropic | Gemini 2.5 Pro |
| Fact checking | Gemini 2.5 Pro | Claude Sonnet 4 | |
| Voice scoring | Claude Sonnet 4 | Anthropic | Gemini 2.5 Pro |
| Embeddings | nomic-embed-text | Ollama | text-embedding-3-small (OpenRouter) |
| Code tasks | Claude Opus 4 | Anthropic | DeepSeek Coder |
The routing table is defined in src/config/ai-routing.ts and can be overridden via environment variables for each use case.
Streaming via SSE
All chat endpoints in Arcturus-Prime use server-sent events for response streaming. The unified SSE format:
event: token
data: {"text": "Hello", "index": 0}
event: token
data: {"text": " world", "index": 1}
event: done
data: {"usage": {"prompt_tokens": 50, "completion_tokens": 12}, "model": "deepseek-v3", "provider": "openrouter"}
The voice engine translates provider-specific streaming formats into this unified format. Clients connect to the SSE endpoint and process events using the EventSource API or a fetch-based SSE reader for POST requests.
Error events are also transmitted via SSE:
event: error
data: {"message": "Rate limit exceeded", "code": 429, "retryAfter": 5000}
The client-side chat components handle error events by showing a user-friendly message and optionally retrying after the indicated delay.