May 18, 2026

Self-Hosted AI Starter Kit: Complete Setup Guide

Set up a self-hosted AI starter kit with the right hardware, models, and orchestration. Compare approaches and learn when cloud-hybrid beats full self-hosting.
Reading time :  
5
 min
Codewords
Codewords

Self-hosted AI starter kit: complete setup guide for 2026

A self-hosted AI starter kit is a pre-packaged stack that lets you run AI models, vector databases, and orchestration tools on your own hardware or cloud instances — without sending data to third-party APIs. The appeal is obvious: data stays on your machines, latency drops, and per-token costs vanish after the upfront investment.

The reality is more nuanced. Running AI locally means owning compute, maintenance, model updates, security, and uptime. A 2025 a16z survey of enterprise AI adoption found that 42% of companies running self-hosted models spent more on infrastructure management than they saved on API costs within the first year. The question is not whether self-hosting is possible — it is whether the trade-offs match your constraints.

Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory. You will see where self-hosting fits, where it does not, and how hybrid architectures solve the hardest trade-offs.

Related reading: AI agents builder, AI automation tools, workflow automation platform, open-source workflow automation platform, CodeWords integrations, pricing, and CodeWords.

TL;DR

  • A self-hosted AI starter kit typically includes an LLM runtime (Ollama, vLLM, or llama.cpp), a vector database (Qdrant, Chroma, or Weaviate), and an orchestration layer (n8n, Langchain, or custom code).
  • Hardware requirements vary dramatically: 7B parameter models run on consumer GPUs, while 70B+ models need enterprise-grade VRAM or multi-GPU setups.
  • For most teams, a hybrid approach — self-host for sensitive data, use managed APIs for everything else — delivers better ROI than full self-hosting.

What is inside a typical self-hosted AI starter kit?

Most kits follow the same three-layer pattern. Think of it as plumbing, brain, and memory.

Layer 1: Model runtime (the brain). This is where inference happens. Popular choices:

  • Ollama: Simplest setup. Run ollama pull llama3 and you have a local model serving API. Great for prototyping, limited for production throughput.
  • vLLM: High-throughput inference engine with PagedAttention. Handles concurrent requests efficiently. Production-ready for teams with GPU infrastructure.
  • llama.cpp: CPU and GPU inference with quantized models. Runs on consumer hardware. The go-to for edge deployments and constrained environments.

Layer 2: Vector database (the memory). Stores embeddings for semantic search and retrieval-augmented generation (RAG). Options include Qdrant, Chroma, Weaviate, and Milvus. For starter kits, Qdrant and Chroma are the most common because they are lightweight and have Docker images ready.

Layer 3: Orchestration (the plumbing). Connects the model, memory, and external tools into workflows. n8n's self-hosted AI starter kit on GitHub bundles Ollama, Qdrant, and n8n into a single Docker Compose file. Other orchestration options include LangChain, LlamaIndex, and custom FastAPI services.

What hardware do you actually need?

Hardware is the gating factor. Underprovision and inference is painfully slow. Overprovision and you are burning capital on idle GPUs.

For 7B parameter models (Llama 3 8B, Mistral 7B):

  • Minimum: 8GB VRAM GPU (RTX 3070 or equivalent), 16GB system RAM, 50GB storage
  • Recommended: 12GB+ VRAM (RTX 4070 Ti), 32GB RAM, NVMe SSD
  • Token throughput: 30–60 tokens/second on consumer hardware

For 13B–34B parameter models (Llama 3 70B quantized, Mixtral):

  • Minimum: 24GB VRAM (RTX 4090 or A5000), 64GB RAM
  • Recommended: 48GB+ VRAM (A6000 or dual GPU), 128GB RAM
  • Token throughput: 15–40 tokens/second depending on quantization

For 70B+ parameter models (unquantized):

  • Minimum: 80GB+ VRAM (A100 or H100), 256GB RAM
  • Recommended: Multi-GPU setup with NVLink, enterprise NVMe storage
  • Token throughput: Varies widely by hardware configuration

A 2026 Tom's Hardware GPU benchmark for local LLM inference showed the RTX 4090 delivering 45 tokens/second on Llama 3 8B Q4 — fast enough for interactive use. For batch processing, throughput matters more than latency, and vLLM on an A100 handles 200+ concurrent requests efficiently.

How do you set up a self-hosted AI starter kit step by step?

Here is the minimal viable setup using Docker Compose with Ollama and Qdrant.

Step 1: Install Docker. If you do not have Docker, install Docker Desktop or Docker Engine on your server.

Step 2: Create a Docker Compose file.

version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

volumes:
  ollama_data:
  qdrant_data:

Step 3: Pull a model. After starting the services, pull a model: docker exec -it ollama ollama pull llama3. For smaller hardware, try mistral or phi3.

Step 4: Test inference. Send a request to http://localhost:11434/api/generate with a JSON body containing your prompt. If you get a streamed response, the runtime is working.

Step 5: Add orchestration. Connect an orchestration layer — n8n, a Python script using LangChain, or a custom FastAPI service — that sends user queries to Ollama, retrieves context from Qdrant, and returns enriched answers.

When should you self-host versus use managed APIs?

The decision matrix is simpler than most guides suggest.

Self-host when:

  • Data cannot leave your network (healthcare, legal, finance, government)
  • You need predictable per-token costs at high volume (10,000+ requests/day)
  • Latency requirements demand local inference (sub-100ms for embeddings)
  • You want to fine-tune models on proprietary data

Use managed APIs when:

  • You need frontier model quality (GPT-4o, Claude Opus, Gemini Ultra)
  • Your volume is moderate and pay-per-token is cheaper than GPU leasing
  • You do not have DevOps capacity to maintain GPU infrastructure
  • You need rapid iteration without managing model updates

Go hybrid when (most teams land here):

  • Self-host embeddings and classification (fast, cheap, sensitive data stays local)
  • Use managed APIs for generation and reasoning (better quality, less maintenance)
  • Use a platform like CodeWords for orchestration, so workflows can call both local models and cloud APIs from the same pipeline

How does CodeWords fit into a self-hosted AI stack?

CodeWords does not replace your self-hosted models — it orchestrates around them. If your data processing and embedding steps run on local Ollama, CodeWords can handle the trigger, routing, integration, and notification layers.

For example, a document processing workflow might:

  1. Trigger when a PDF arrives in Google Drive
  2. Call your local Ollama instance to extract and classify content
  3. Store embeddings in your self-hosted Qdrant
  4. Use CodeWords' native LLM access to generate a summary with a frontier model
  5. Push results to Notion, Slack, or HubSpot via CodeWords integrations

This hybrid pattern keeps sensitive data local while using managed AI for generation tasks where quality matters most.

FAQ

Is a self-hosted AI starter kit free?

The software is typically free and open source. The cost is hardware: a capable GPU ($800–$2,000 for consumer, $10,000+ for enterprise), electricity, and maintenance time. Cloud GPU instances (Lambda Labs, RunPod, AWS) cost $0.50–$4.00/hour depending on GPU type.

Can I fine-tune models in a self-hosted kit?

Yes. Tools like Axolotl and Unsloth support fine-tuning on consumer GPUs using LoRA and QLoRA techniques. Fine-tuning Llama 3 8B on a single RTX 4090 takes 2–6 hours depending on dataset size.

How do I keep self-hosted models updated?

Follow model release channels (Hugging Face, Ollama library). Test new models in a staging environment before swapping production. Automate model pulls with a cron job or a scheduled CodeWords workflow.

What about security for self-hosted AI?

Self-hosting improves data privacy but introduces infrastructure security obligations. Isolate the AI stack on a dedicated network segment, restrict API access, encrypt data at rest, and audit inference logs. The model itself does not phone home, but your orchestration layer might.

Where self-hosting actually leads

The real destination is not a self-hosted AI stack. It is decision infrastructure — systems that read, reason, and act on your data without leaving your perimeter. The starter kit is the first brick. Production readiness comes from orchestration, monitoring, and the discipline to draw a clear line between what runs locally and what runs in the cloud.

Start the orchestration layer in CodeWords and connect it to your local AI infrastructure through custom API calls.

Contents
Ready to try CodeWords?
Get started free
Sign in
Sign in