BlogEngineering

OpenRouter embeddings: route to the right model every time

Learn how to use OpenRouter embeddings to access multiple embedding models through one API. Practical workflows, cost comparisons, and CodeWords integratio

Isha MagguJune 9, 20267 min read

OpenRouter embeddings: route to the right model every time

OpenRouter embeddings let you access multiple embedding providers — OpenAI, Cohere, Google, Mistral — through a single API endpoint, swapping models with a parameter change instead of rewriting integration code. If you've been hardcoding text-embedding-3-small and wondering whether Cohere's embed-v4 would perform better for your use case, OpenRouter removes the switching cost.

Embeddings power every retrieval pipeline. They turn text into vectors, and the quality of those vectors determines whether your search returns the right document or a vaguely related one. According to Menlo Ventures' 2025 State of Generative AI report, retrieval-augmented generation (RAG) is the most deployed enterprise AI pattern, used by 51% of companies surveyed. The OpenRouter docs show support for 300+ models across providers, including embedding endpoints.

Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory.

TL;DR

OpenRouter provides a unified API for accessing embedding models from OpenAI, Cohere, Google, and others — switch providers by changing a model parameter.
Cost differences across embedding providers are significant: choosing the wrong model can cost 10x more for identical retrieval quality on your specific dataset.
CodeWords can orchestrate embedding pipelines with built-in LLM routing, no API key juggling, and serverless execution for batch processing.

What are OpenRouter embeddings and why use a router?

Think of OpenRouter as a switchboard operator for AI models. Instead of wiring a direct line to every provider, you connect once and dial the model you need.

The practical value comes from three angles.

Comparison without commitment. You can benchmark embedding quality across providers on your actual data. Run the same 1,000 documents through OpenAI's text-embedding-3-large, Cohere's embed-v4, and Google's text-embedding-005, then measure retrieval accuracy on your test queries. Without a router, that experiment requires three separate integrations, three API keys, and three billing accounts.

Fallback resilience. Embedding APIs go down. OpenAI had multiple outages in 2024, some lasting hours. A routing layer lets you fall back to another provider automatically. Your RAG pipeline keeps running while one provider recovers.

Cost optimization. Embedding prices vary dramatically. OpenAI's text-embedding-3-small costs $0.02 per million tokens, while text-embedding-3-large costs $0.13 per million tokens — a 6.5x difference. Cohere and Google sit at different price points again. For batch processing millions of documents, that difference is material.

How do you generate embeddings through OpenRouter?

The API follows the OpenAI-compatible format. If you've used the OpenAI SDK, the switch is minimal.

Basic embedding request:

import requests

response = requests.post(
    "https://openrouter.ai/api/v1/embeddings",
    headers={
        "Authorization": "Bearer YOUR_OPENROUTER_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "openai/text-embedding-3-small",
        "input": "How does workflow automation reduce operational cost?"
    }
)

embedding = response.json()["data"][0]["embedding"]
print(f"Dimensions: {len(embedding)}")

Swap openai/text-embedding-3-small for cohere/embed-english-v3.0 or google/text-embedding-004 — the rest stays identical.

Batch processing with CodeWords:

In CodeWords, you'd build this as a workflow that reads documents from Google Drive, generates embeddings in batches, and stores vectors in Pinecone or Weaviate. Cody (CodeWords' AI assistant) handles the FastAPI scaffold, chunking logic, and retry handling. The ephemeral E2B sandbox runs each batch in isolation.

async def embed_documents(documents: list[str], model: str = "openai/text-embedding-3-small"):
    results = []
    for batch in chunk_list(documents, batch_size=100):
        response = await openrouter_embed(batch, model=model)
        results.extend(response.embeddings)
    return results

For a deeper look at building AI assistants that use these embeddings for retrieval, see the Pinecone assistant guide.

Which embedding model should you choose?

Model selection depends on three variables: dimension size, retrieval accuracy on your domain, and cost per token.

OpenAI text-embedding-3-small - 1,536 dimensions (configurable down to 256) - $0.02 / million tokens - Strong general-purpose performance - Best for: cost-sensitive applications, prototyping

OpenAI text-embedding-3-large - 3,072 dimensions (configurable) - $0.13 / million tokens - Top-tier retrieval accuracy on benchmarks - Best for: production RAG where quality is paramount

Cohere embed-v3 (English) - 1,024 dimensions - Supports search_document and search_query input types - Compression-friendly architecture - Best for: multilingual search, Cohere Rerank pipelines

Google text-embedding-005 - 768 dimensions - Competitive pricing through Google Cloud - Strong on technical and scientific text - Best for: Google Cloud-native stacks

The MTEB leaderboard tracks retrieval benchmarks across models. As of early 2026, the top embedding models cluster closely in accuracy — the gap has narrowed. That means cost and operational factors (latency, uptime, rate limits) weigh more heavily in the decision.

How do you build a RAG pipeline with OpenRouter embeddings?

A retrieval-augmented generation pipeline has four stages: ingest, embed, store, retrieve. OpenRouter handles stage two.

Stage 1 — Ingest: Pull source documents from wherever they live. CodeWords supports this through native integrations with Google Drive, Airtable, and web scraping via Firecrawl. See document loaders for patterns.

Stage 2 — Embed: Chunk documents (500-1,000 tokens per chunk works for most use cases), then call OpenRouter's embedding endpoint. Include metadata — source URL, document title, timestamp — with each vector.

Stage 3 — Store: Push vectors to a vector database. Pinecone, Weaviate, Qdrant, and Chroma are common choices. Each has trade-offs in managed vs. self-hosted, filtering capabilities, and pricing.

Stage 4 — Retrieve: When a query arrives, embed it with the same model used for documents, search the vector store for nearest neighbors, and pass the top results as context to an LLM for answer generation.

async def rag_query(question: str, index, model="openai/text-embedding-3-small"):
    query_embedding = await openrouter_embed(question, model=model)
    results = index.query(vector=query_embedding, top_k=5, include_metadata=True)
    context = "\n\n".join([r.metadata["text"] for r in results.matches])
    answer = await llm_generate(
        prompt=f"Answer based on this context:\n{context}\n\nQuestion: {question}"
    )
    return answer

For production RAG, add reranking after retrieval to improve precision. Cohere Rerank and cross-encoder models can reorder results before they reach the LLM. See AI-powered development tools for more patterns.

What are common pitfalls with embedding APIs?

Mixing models between indexing and querying. If you embed documents with text-embedding-3-small and queries with text-embedding-3-large, similarity scores are meaningless. The vector spaces are different. Always use the same model for both.

Ignoring chunking strategy. Embedding a 10,000-word document as a single vector buries the signal. Embedding every sentence creates noise. The sweet spot is semantic chunking — splitting at paragraph or section boundaries, keeping 300-800 tokens per chunk.

Not tracking model versions. When a provider updates their embedding model, your existing vectors become stale. You either re-embed everything or maintain version metadata. OpenRouter's model naming includes versions, which helps.

Rate limit collisions. Batch embedding jobs can hit rate limits fast. According to OpenAI's documentation, tier-1 accounts start at 3,500 RPM for embeddings. Build backoff logic or use CodeWords' managed execution, which handles retry patterns automatically. See OpenAI API limits for a detailed breakdown.

How do you monitor embedding quality over time?

Embedding pipelines degrade silently. A model update, a shift in your document corpus, or a subtle data formatting change can erode retrieval accuracy without triggering any errors.

Build a test harness: maintain 50-100 known query-document pairs where you know the correct answer. Run this suite weekly. Track hit@5 and hit@10 rates. When accuracy drops below your threshold, investigate whether the embedding model changed, your chunking shifted, or your corpus evolved.

CodeWords supports scheduled workflows that can run this evaluation automatically, store results in Google Sheets or Airtable, and alert via Slack when quality degrades. This monitoring pattern is the same one described in workflow automation tools.

FAQ

Does OpenRouter support all embedding models?

OpenRouter supports embedding models from major providers including OpenAI, Cohere, and Google. The model list updates regularly. Check the OpenRouter models page for the current roster. Not every model on the platform supports embeddings — filter for embedding-capable models specifically.

Is OpenRouter more expensive than calling providers directly?

OpenRouter adds a small markup over direct API pricing. For most teams, the operational savings — single API key, unified billing, easy model switching — outweigh the marginal cost increase. Run the math on your expected volume.

Can I use OpenRouter embeddings with Pinecone?

Yes. Generate embeddings through OpenRouter, then upsert the vectors into Pinecone with standard client libraries. The vector database doesn't care which API produced the embeddings — it only needs the vectors and metadata. See the Pinecone assistant guide for a complete walkthrough.

How do I handle rate limits when batch embedding?

Implement exponential backoff with jitter. Start with 100ms delays, double on each 429 response, and add random jitter to prevent thundering herd. CodeWords handles this natively in its serverless execution layer, but if you're building from scratch, libraries like tenacity (Python) simplify retry logic.

Where embeddings are heading

The router pattern for embeddings reflects a broader shift: AI infrastructure is becoming composable. You pick the best model for each task, swap it when something better arrives, and build workflows that survive individual provider failures.

The implication for your RAG pipeline is that lock-in to a single embedding provider is now an unforced error. Route intelligently, benchmark continuously, and let the infrastructure handle the plumbing.

Start building embedding workflows at CodeWords — connect your data sources, pick your model, and deploy a retrieval pipeline in a single conversation with Cody.