Pinecone assistant: build AI search that understands
Pinecone assistant: build AI search that understands
A Pinecone assistant is an AI application that uses Pinecone's vector database to find relevant information and an LLM to generate answers grounded in that information. It is retrieval-augmented generation (RAG) with a specific, production-grade vector store underneath. The difference between a Pinecone assistant and a raw ChatGPT wrapper is the difference between a librarian who reads the books and one who guesses.
Vector search has become the default retrieval method for AI applications. Pinecone reported over 100,000 registered developers on their platform as of 2025. According to Menlo Ventures' 2025 enterprise AI survey, 51% of enterprises deploying generative AI use RAG as their primary pattern — more than fine-tuning or prompt engineering alone.
Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory.
TL;DR
- A Pinecone assistant combines vector search (finding relevant documents) with LLM generation (producing answers) — grounding responses in your actual data instead of model training data.
- The build involves four stages: ingest documents, generate embeddings, store vectors in Pinecone, and retrieve-then-generate at query time.
- CodeWords handles the entire pipeline as a serverless workflow — document ingestion, embedding generation (with built-in LLM access), Pinecone writes, and query handling.
Why does a Pinecone assistant outperform simple LLM queries?
An LLM without retrieval is working from memory — its training data, frozen at a cutoff date, blended across billions of documents. Ask it about your company's refund policy, your product's API changelog, or last quarter's sales data, and it will either hallucinate or refuse.
A Pinecone assistant flips the model. Instead of asking "what do you know?", it asks "what's in these specific documents?" The LLM becomes a reading comprehension engine rather than a knowledge recall engine.
This matters for three reasons:
Accuracy. Answers are grounded in source documents you control. Hallucination drops dramatically when the model has relevant context.
Currency. New documents are available for retrieval immediately after indexing. No retraining, no fine-tuning, no waiting for the next model version.
Auditability. Every answer can cite its sources. You can trace which documents informed which response — critical for compliance-sensitive industries.
The architecture parallels a research assistant. You give them a filing cabinet (Pinecone), a question (user query), and instructions (system prompt). They find the relevant files, read them, and synthesize an answer. The filing cabinet is the differentiator.
How do you set up Pinecone for an assistant?
Start with a Pinecone account and create an index. The index configuration depends on your embedding model.
from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_KEY")
pc.create_index(
name="assistant-docs",
dimension=1536,
metric="cosine",
spec={"serverless": {"cloud": "aws", "region": "us-east-1"}}
)
index = pc.Index("assistant-docs")
Key decisions:
- Dimension: Must match your embedding model. OpenAI's
text-embedding-3-smalloutputs 1,536 dimensions. Cohere'sembed-v3outputs 1,024. See the OpenRouter embeddings guide for model comparison. - Metric: Cosine similarity is the standard choice for normalized embeddings. Dot product works if you want magnitude to influence results.
- Serverless vs. pod-based: Serverless indexes scale automatically and cost less at lower volumes. Pod-based gives predictable performance for high-throughput workloads.
How do you ingest and embed documents?
The ingestion pipeline has three steps: load, chunk, embed.
Loading documents: Pull content from wherever it lives. CodeWords supports native integrations with Google Drive, Airtable, and web scraping via Firecrawl. For other sources, the document loaders guide covers common patterns.
Chunking: Split documents into segments that are small enough to embed meaningfully but large enough to retain context. A common heuristic: 500-800 tokens per chunk with 50-100 token overlap between consecutive chunks.
def chunk_text(text: str, chunk_size: int = 600, overlap: int = 80):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
Embedding and upserting:
import openai
def embed_and_upsert(chunks: list, source_id: str, index):
response = openai.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
vectors = [
{
"id": f"{source_id}-{i}",
"values": item.embedding,
"metadata": {"text": chunks[i], "source": source_id}
}
for i, item in enumerate(response.data)
]
index.upsert(vectors=vectors, batch_size=100)
In CodeWords, you describe the ingestion pipeline to Cody — "pull all PDFs from this Google Drive folder, chunk them, embed with OpenAI, store in Pinecone" — and it generates the serverless workflow. The platform's built-in LLM access means no API key management for the embedding step. See CodeWords templates for pre-built RAG workflows.
How do you build the query and response pipeline?
At query time, the assistant performs three operations in sequence: embed the question, search Pinecone, generate a response.
async def query_assistant(question: str, index, top_k: int = 5):
query_embedding = openai.embeddings.create(
model="text-embedding-3-small",
input=question
).data[0].embedding
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
context = "\n\n---\n\n".join(
[match.metadata["text"] for match in results.matches]
)
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"""Answer based on the provided context.
If the context doesn't contain enough information, say so.
Cite the source document when possible.
Context:
{context}"""},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
The system prompt is critical. Instruct the model to stick to the provided context, acknowledge gaps, and cite sources. Without these guardrails, the LLM will fill gaps with its training data — exactly the behavior you're trying to avoid.
For production deployments, add conversation memory so follow-up questions work naturally. CodeWords' Redis-based state persistence stores conversation history per user. See AI-powered code generation tools for related patterns.
How do you improve retrieval quality?
Raw vector search returns the mathematically nearest neighbors. That does not always mean the most useful results. Three techniques close the gap.
Hybrid search. Combine vector similarity with keyword matching (BM25). Pinecone supports sparse-dense hybrid search natively. Queries with specific terms ("error code 4032") benefit from keyword matching; conceptual queries ("why is my deployment failing?") benefit from semantic search.
Metadata filtering. Use Pinecone's metadata filters to narrow the search space before similarity ranking. Filter by document type, date range, department, or access level.
results = index.query(
vector=query_embedding,
top_k=10,
include_metadata=True,
filter={"department": {"$eq": "engineering"}, "year": {"$gte": 2025}}
)
Reranking. After initial retrieval, pass results through a cross-encoder or Cohere Rerank to reorder by relevance. This two-stage approach — fast vector search for recall, slow reranking for precision — is the standard production pattern.
For a broader look at how retrieval fits into AI workflows, see AI workflow tools and workflow automation tools.
What are the operational considerations?
Index maintenance. Documents change. Build an update pipeline that detects modifications, re-embeds affected chunks, and upserts new vectors. Delete vectors for removed documents. CodeWords' scheduled workflows handle this as a nightly sync job.
Cost management. Pinecone's serverless tier charges per read/write unit and per GB stored. For a 100,000-document corpus with moderate query volume, expect $50-200/month. Monitor read units — inefficient queries (high top_k, missing metadata filters) cost more.
Latency. Vector search in Pinecone typically returns in 10-50ms. The LLM generation step dominates latency at 500ms-3s. For latency-sensitive applications, cache frequent queries or use a smaller model for common questions.
FAQ
How many documents can a Pinecone assistant handle?
Pinecone's serverless indexes scale to billions of vectors. Practical limits are cost and ingestion throughput rather than capacity. Most assistants serve well with 10,000-1,000,000 documents.
Do I need to fine-tune an LLM for a Pinecone assistant?
Usually not. RAG with a strong base model (GPT-4o, Claude 3.5 Sonnet) outperforms fine-tuned models for most knowledge retrieval tasks. Fine-tuning is better for changing the model's behavior or style, not for adding knowledge. A 2024 study by Databricks found that RAG outperformed fine-tuning for factual Q&A in 85% of evaluated domains.
Can I use Pinecone with open-source embedding models?
Yes. Pinecone stores vectors regardless of their source. Use models from Hugging Face (e.g., BAAI/bge-large-en-v1.5) or through OpenRouter for a unified API. Match the index dimension to the model's output dimension.
How do I handle sensitive data in the vector store?
Implement namespace isolation — separate indexes or namespaces for different access levels. Apply metadata filters to enforce access control at query time. Encrypt data at rest (Pinecone handles this by default on paid plans). Never store raw sensitive text in metadata if you can avoid it — store references and fetch from a secured source at response time.
The assistant pattern at scale
A Pinecone assistant is not a chatbot. It is a retrieval interface for your organization's knowledge. The implication extends past Q&A: the same pattern powers internal search, customer support automation, compliance checking, and document review workflows.
Build your Pinecone assistant on CodeWords — describe what you need to Cody, connect your data sources, and deploy a production RAG pipeline without managing infrastructure.




