BlogResources

OpenAI API limits: rate limits, quotas, and workarounds

Understand OpenAI API limits — rate limits, token quotas, and tier thresholds. Practical strategies to stay under limits and scale production workloads.

Osman RamadanJune 9, 20267 min read

OpenAI API limits: rate limits, quotas, and workarounds

OpenAI API limits are the guardrails between your application and OpenAI's infrastructure — rate limits (requests per minute), token limits (tokens per minute and per day), and model-specific context windows. Every production application hits them eventually. The question is whether you hit them gracefully or catastrophically.

The direct answer: OpenAI enforces limits at three levels — requests per minute (RPM), tokens per minute (TPM), and tokens per day (TPD) — organized into usage tiers that increase with your payment history. According to OpenAI's official documentation, a Tier 1 account (after $5 in payments) gets 500 RPM and 200,000 TPM for GPT-4o. A Tier 5 account gets 10,000 RPM and 30,000,000 TPM. The 2025 Retool State of AI report found that 62% of developers building with LLM APIs reported rate limiting as their top operational challenge.

Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory.

TL;DR

OpenAI enforces RPM (requests per minute), TPM (tokens per minute), and TPD (tokens per day) limits organized into five usage tiers based on payment history and account age.
Rate limit errors (HTTP 429) are recoverable with exponential backoff, but architectural patterns — batching, caching, model routing — prevent them in the first place.
CodeWords manages rate limiting natively in its serverless execution layer, with built-in retry logic and multi-model routing through OpenAI, Anthropic, and Google Gemini.

What are the current OpenAI API rate limits?

OpenAI organizes limits into tiers. You advance by accumulating payment history and account age. Here are the key thresholds as of 2026:

Tier 1 (after $5 payment) - GPT-4o: 500 RPM, 200,000 TPM - GPT-4o-mini: 500 RPM, 2,000,000 TPM - text-embedding-3-small: 3,000 RPM, 1,000,000 TPM - Spend limit: $100/month

Tier 2 (after $50 spent, 7+ days) - GPT-4o: 5,000 RPM, 450,000 TPM - GPT-4o-mini: 5,000 RPM, 4,000,000 TPM - Spend limit: $500/month

Tier 3 (after $100 spent, 7+ days) - GPT-4o: 5,000 RPM, 800,000 TPM - Spend limit: $1,000/month

Tier 5 (after $1,000 spent, 30+ days) - GPT-4o: 10,000 RPM, 30,000,000 TPM - Spend limit: $50,000/month

These numbers change. Check OpenAI's rate limits page for current values. The pattern is consistent: more payment history unlocks higher limits.

Think of rate limits as a highway on-ramp meter. The highway (OpenAI's GPU clusters) has finite capacity. The meter regulates how fast each driver (API consumer) enters. Paying more gets you into the express lane, but everyone shares the same road.

How do rate limit errors work?

When you exceed a limit, OpenAI returns HTTP 429 (Too Many Requests) with headers telling you what you hit:

x-ratelimit-limit-requests: Your RPM cap
x-ratelimit-remaining-requests: Requests left in the current window
x-ratelimit-reset-requests: When the request counter resets
x-ratelimit-limit-tokens: Your TPM cap
x-ratelimit-remaining-tokens: Tokens left

A 429 is not an error in the failure sense. It is a flow control signal. Your application should handle it the same way TCP handles congestion — back off, wait, retry.

import time
import openai
from openai import RateLimitError

def call_with_backoff(messages, model="gpt-4o", max_retries=5):
    for attempt in range(max_retries):
        try:
            return openai.chat.completions.create(
                model=model, messages=messages
            )
        except RateLimitError as e:
            wait = min(2 ** attempt + random.uniform(0, 1), 60)
            time.sleep(wait)
    raise Exception("Max retries exceeded")

The exponential backoff with jitter is essential. Without jitter, concurrent processes retry simultaneously, creating a thundering herd that makes the problem worse.

What architectural patterns prevent rate limit issues?

Handling 429s reactively is the minimum. These patterns prevent them proactively.

Pattern 1 — Request batching. Instead of sending 100 individual API calls, batch related work. OpenAI's Batch API processes large volumes at 50% cost reduction with 24-hour turnaround. Ideal for non-real-time workloads like content generation, classification, and data extraction.

Pattern 2 — Response caching. Identical or semantically similar queries should return cached results. A Redis cache with embedding-based similarity matching catches near-duplicates. CodeWords' Redis state persistence layer handles this natively. For embedding-based caching, see the OpenRouter embeddings guide.

Pattern 3 — Model routing. Not every request needs GPT-4o. Route simple classification tasks to GPT-4o-mini (which has 4x the TPM allowance at Tier 1), save GPT-4o for complex reasoning. CodeWords provides access to OpenAI, Anthropic, and Google Gemini through a unified interface — if one provider hits limits, route to another. See AI coding models for model selection guidance.

Pattern 4 — Queue-based processing. Instead of firing API calls synchronously, push requests to a queue and process them at a controlled rate. A token bucket algorithm ensures you use your full allocation without exceeding it.

In CodeWords, the serverless execution layer handles rate limiting automatically. Workflows that call LLMs get retry logic, backoff, and model fallback without explicit configuration. See CodeWords pricing for execution costs.

How do token limits differ from rate limits?

Rate limits cap throughput (how fast). Token limits cap capacity (how much).

Context window limits: Each model has a maximum context length. GPT-4o supports 128,000 tokens per request. GPT-4o-mini also supports 128,000 tokens. These are hard limits — exceed them and the request fails, no retry possible.

Tokens per minute (TPM): This is the aggregate token throughput. A request with a 10,000-token prompt and 2,000-token response consumes 12,000 tokens from your TPM budget. Large prompts (RAG contexts, long documents) deplete TPM faster than many small requests.

Tokens per day (TPD): Some tiers enforce daily caps. Once hit, you wait until the next day or upgrade your tier.

The practical impact: a RAG pipeline that sends 5,000-token contexts with each query uses 25x more TPM than a simple classification prompt with 200 tokens. Design prompts to be efficient. Trim context to what's relevant. Use the OpenAI structured outputs feature to constrain response length.

How do you monitor API usage effectively?

Blind usage leads to surprise 429s and unexpected bills. Monitor both.

OpenAI's dashboard shows daily usage, costs by model, and current tier. Check it weekly minimum.

Application-level monitoring: Log every API call with model, token count (prompt + completion), latency, and status code. Aggregate into dashboards. Alert when usage hits 80% of your tier limit.

Cost tracking: GPT-4o costs $2.50 per million input tokens and $10 per million output tokens (as of 2026). A workflow processing 10,000 documents at 1,000 tokens each costs roughly $25 for embedding or $100+ for GPT-4o processing. Track costs per workflow, per user, per feature.

CodeWords workflows can pipe usage metrics to Slack for alerts or Google Sheets for tracking. See workflow automation examples and AI workflow automation for monitoring patterns.

What happens when limits aren't enough?

If Tier 5 limits still aren't sufficient, you have options:

Contact OpenAI sales for custom rate limits. Enterprise agreements unlock significantly higher thresholds.
Distribute across organizations. Multiple OpenAI organizations with separate billing can multiply your effective limits. Each org gets its own tier allocation.
Use multiple providers. Route overflow traffic to Anthropic (Claude) or Google (Gemini). Different providers, different limits, same capability class. CodeWords makes this trivial since it provides access to all three without separate API keys. See custom AI agent development for multi-provider architectures.

FAQ

How do I check my current OpenAI API tier?

Log into platform.openai.com, go to Settings → Limits. Your current tier and rate limits are displayed there. You can also check programmatically by reading the x-ratelimit-* response headers from any API call.

Do rate limits apply to the Assistants API?

Yes. The Assistants API shares the same rate limit pool as the Chat Completions API. Runs and messages count against your RPM and TPM quotas for the underlying model.

Are embedding API limits separate from chat limits?

Yes. Embedding models (text-embedding-3-small, text-embedding-3-large) have separate RPM and TPM quotas from chat models. You can max out your embedding budget without affecting your GPT-4o availability.

What's the difference between rate limits and spending limits?

Rate limits cap throughput (requests and tokens per minute). Spending limits cap total cost (dollars per month). Both can stop your application — rate limits cause 429 errors, spending limits cause 402 errors. Set spending limits intentionally to prevent runaway costs from bugs or unexpected traffic.

Limits as design constraints

Rate limits are not obstacles. They are design constraints that push you toward better architecture: caching, batching, model routing, efficient prompts. Applications that handle limits well are also more cost-efficient and more resilient to provider outages.

The implication is that rate limit planning is not a DevOps afterthought — it is an architectural decision that shapes how you build every AI-powered feature.

Start building rate-limit-aware AI workflows at CodeWords — multi-provider access, automatic retry handling, and managed serverless execution out of the box.