Skip to main content
AI & Automation

Multi-Provider AI Architecture

Voice engine design, provider configuration, model routing, and streaming architecture for Arcturus-Prime AI services

February 23, 2026

Multi-Provider AI Architecture

Arcturus-Prime routes AI requests across four providers: OpenRouter for cost-effective and free-tier access, Anthropic for premium Claude models, Google GenAI for Gemini models, and local Ollama for embedding generation and privacy-sensitive workloads. The voice engine at src/lib/voice-engine.ts acts as the central orchestration layer, managing provider selection, model routing, streaming, and fallback logic.

Voice Engine (src/lib/voice-engine.ts)

The voice engine is the core abstraction that all AI features in Arcturus-Prime use. Rather than calling provider APIs directly, every component calls into the voice engine, which handles provider selection, authentication, request formatting, response streaming, and error recovery.

Architecture

The voice engine exposes a unified interface:

interface VoiceEngine {
  chat(options: ChatOptions): AsyncGenerator<string>;
  generate(options: GenerateOptions): Promise<string>;
  embed(text: string | string[]): Promise<number[][]>;
  score(content: string, profile: string): Promise<VoiceScore>;
}

The chat method returns an async generator that yields tokens as they arrive from the provider’s streaming API. The generate method buffers the full response and returns it as a string. The embed method generates vector embeddings for text input. The score method evaluates content against a voice profile and returns a structured score.

Provider Registration

Each provider is registered with the voice engine at startup:

engine.registerProvider('openrouter', {
  baseUrl: 'https://openrouter.ai/api/v1',
  apiKey: process.env.OPENROUTER_API_KEY,
  models: ['deepseek/deepseek-chat-v3-0324:free', 'moonshotai/kimi-k2.5:free', ...],
  streaming: true,
  rateLimit: { requests: 60, window: 60000 }
});

The registration includes the base URL, authentication credentials, available models, streaming capability flag, and rate limiting configuration. The engine enforces rate limits per provider and queues requests that exceed the limit.

Fallback Logic

When a provider request fails (HTTP 5xx, timeout, rate limit exceeded), the voice engine attempts fallback to an alternative provider. The fallback chain is configurable per use case:

  1. Primary provider attempt
  2. Wait 1 second, retry same provider
  3. Fallback to secondary provider
  4. Fallback to tertiary provider
  5. Return error to caller

For example, if an admin chat request to Anthropic Claude fails, the engine falls back to OpenRouter DeepSeek, then to Google Gemini. The fallback chain is transparent to the calling component.

OpenRouter

OpenRouter is the primary provider for most AI interactions in Arcturus-Prime. It provides access to a wide range of models through a single API, with some models available on a free tier.

Configuration

OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

The API key is stored in the environment and injected at build/runtime. OpenRouter uses the OpenAI-compatible API format, making it a drop-in replacement for any OpenAI SDK calls.

Free Tier Models

OpenRouter offers several models at zero cost, which Arcturus-Prime uses for public-facing features:

  • Kimi K2.5 (moonshotai/kimi-k2.5:free) — the default model for the public chat widget on Arcturus-Prime.com. Selected for its strong conversational ability at zero cost. Visitors can chat with the site’s AI assistant without incurring any API charges.
  • DeepSeek V3 (deepseek/deepseek-chat-v3-0324:free) — used as the fallback free model and for bulk content processing tasks where cost matters more than peak quality.

Admin Models

For admin-only features, Arcturus-Prime uses higher-capability models through OpenRouter:

  • Llama 3.3 70B — used for general admin chat when cost-effectiveness is important
  • Mistral Large — used for structured output tasks (JSON generation, schema validation)
  • DeepSeek Coder — used for code-related conversations in the admin workbench

Model availability is checked at startup via the OpenRouter models API, and the model picker in the admin UI only shows models that are currently available.

Anthropic Claude

Anthropic Claude models are used for premium AI features in the admin panel where quality matters most.

Configuration

ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Models and Use Cases

  • Claude Sonnet 4 — the workhorse model for content generation, voice scoring, and writing coach features. It provides the best balance of quality, speed, and cost for writing-intensive tasks. Used by /api/admin/content-gen, /api/admin/voice-check, and /api/admin/ai-coach.
  • Claude Opus 4 — reserved for the admin workbench and complex reasoning tasks. Its higher cost means it is only available to admin-role users through explicit model selection.

Speech-to-Text

Anthropic’s API is also used for speech-to-text transcription in the content pipeline. Audio recordings from voice memos or pair programming sessions are transcribed using Claude’s multimodal capabilities, then fed into the transcript-to-post pipeline.

Streaming

Anthropic streaming uses their native SSE format with event: content_block_delta events. The voice engine translates these into the unified token stream format used by all Arcturus-Prime chat interfaces. The translation handles Anthropic-specific event types:

  • message_start — initializes the response buffer
  • content_block_start — begins a new content block
  • content_block_delta — contains the actual token text
  • content_block_stop — ends the current content block
  • message_delta — contains stop reason and usage metrics
  • message_stop — signals completion

Google GenAI

Google GenAI provides access to Gemini models, used primarily for fact-checking and testing.

Configuration

GOOGLE_GENAI_API_KEY=AIzaXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Models

  • Gemini 2.5 Pro — used for the fact-checking pipeline at /api/admin/fact-check. Gemini’s strong grounding capability and large context window make it effective at verifying factual claims in blog content.
  • Gemini 2.5 Flash — used for rapid classification tasks: categorizing content, suggesting tags, and routing queries.

Testing

Google GenAI integration is tested via scripts/test-gemini-models.js, a standalone script that sends test prompts to available Gemini models and reports latency, token usage, and output quality metrics. This script runs as part of the provider health check routine and alerts if Google API connectivity degrades.

Ollama (Local)

Ollama runs locally on Capella-Outpost (10.42.0.100) at http://localhost:11434 and handles workloads that benefit from local execution: embedding generation, privacy-sensitive processing, and development/testing.

Model Auto-Discovery

On startup, the voice engine queries the Ollama API at http://localhost:11434/api/tags to discover which models are currently loaded. The response includes model names, sizes, and modification dates. Discovered models are automatically registered with the voice engine and appear in the admin model picker.

Typical models running on the Ollama instance:

  • nomic-embed-text — the primary embedding model used for the RAG pipeline. Generates 768-dimensional embeddings locally without any external API calls.
  • llama3.2 — a general-purpose chat model for development testing and offline use
  • mistral — used for structured output generation during development
  • codellama — code-specialized model for development assistance

Embedding Generation

Ollama is the default provider for embedding generation in the RAG pipeline. The scripts/build-embeddings.js script can be configured to use either Ollama or OpenRouter for embeddings. Ollama is preferred when:

  • Building embeddings for sensitive content that should not leave the local network
  • Running batch embedding jobs where API rate limits would be a bottleneck
  • Development and testing of the RAG pipeline

The Ollama embedding endpoint at http://localhost:11434/api/embeddings accepts a model name and text input and returns a vector. The voice engine batches multiple embedding requests to reduce round-trip overhead.

RAG Pipeline Integration

The full RAG pipeline uses Ollama for both embedding generation and local inference. When a user query comes in through the RAG-enabled chat endpoint, the pipeline:

  1. Embeds the query using Ollama’s nomic-embed-text
  2. Searches the vector index for similar content chunks
  3. Retrieves the top-K chunks as context
  4. Sends the query plus context to the selected chat model (which may be any provider)

This hybrid approach keeps embedding costs at zero while allowing the final generation step to use whichever model is best for the task.

Model Routing

The voice engine implements a routing table that maps use cases to default models and providers:

Use CaseModelProviderFallback
Public chatKimi K2.5OpenRouter (free)DeepSeek V3 (free)
Admin chatDeepSeek V3OpenRouterClaude Sonnet 4
Content generationClaude Sonnet 4AnthropicGemini 2.5 Pro
Fact checkingGemini 2.5 ProGoogleClaude Sonnet 4
Voice scoringClaude Sonnet 4AnthropicGemini 2.5 Pro
Embeddingsnomic-embed-textOllamatext-embedding-3-small (OpenRouter)
Code tasksClaude Opus 4AnthropicDeepSeek Coder

The routing table is defined in src/config/ai-routing.ts and can be overridden via environment variables for each use case.

Streaming via SSE

All chat endpoints in Arcturus-Prime use server-sent events for response streaming. The unified SSE format:

event: token
data: {"text": "Hello", "index": 0}

event: token
data: {"text": " world", "index": 1}

event: done
data: {"usage": {"prompt_tokens": 50, "completion_tokens": 12}, "model": "deepseek-v3", "provider": "openrouter"}

The voice engine translates provider-specific streaming formats into this unified format. Clients connect to the SSE endpoint and process events using the EventSource API or a fetch-based SSE reader for POST requests.

Error events are also transmitted via SSE:

event: error
data: {"message": "Rate limit exceeded", "code": 429, "retryAfter": 5000}

The client-side chat components handle error events by showing a user-friendly message and optionally retrying after the indicated delay.

aiprovidersopenrouteranthropicgoogleollamastreamingvoice-engine