Rate limits

Understand per-key and per-IP throttling, expected 429 responses, and recommended backoff strategies.

Overview

KushRouter enforces rate limits at multiple layers to protect stability and fairness.
Limits vary by plan and may adapt based on real-time system conditions.
Two primary policies are commonly applied:
- Per-API key limits
- Per-IP address limits

Account vs key-level

Some plans also enforce account-level budgets separate from individual API key limits.
If both apply, the more restrictive limit will trigger first (e.g., account budget exhausted yields 429 even if a key still has headroom).

Clients should implement robust retry and backoff to gracefully handle 429 responses.

Rate limits are generally enforced in rolling windows (e.g., per‑minute) and may also include per‑second burst guards on some providers. Treat both as independent caps: hitting either will yield 429.

429 — Rate limit exceeded

Example error response:

{
  "error": "Rate limit exceeded (key)"
}

Variants:

Rate limit exceeded (ip)
Rate limit exceeded (key)

Headers

On 2xx and 429 responses, the server may include standard rate limit headers when available:

X-RateLimit-Limit: maximum requests in the current window
X-RateLimit-Remaining: remaining requests in the window
X-RateLimit-Reset: epoch seconds when the window resets
Retry-After: seconds or HTTP-date to wait before retrying (primarily on 429)

Use these headers to pace requests and avoid bursts.

Example curl (showing headers):

curl -i -X POST "https://api.kushrouter.com/api/openai/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5-mini-2025-08-07",
    "messages": [{"role":"user","content":"Hello"}]
  }'

Best practices

Use exponential backoff with jitter.
Avoid sending large bursts; use small, steady concurrency.
Prefer streaming for long responses to reduce end-to-end timeouts.
For batch workloads, prefer the Batches API where appropriate.

Burst vs steady-state

Short bursts may be allowed, but sustained high QPS should remain below the published limits.
Distribute concurrency evenly over time and across workers to avoid synchronized spikes.

Backoff examples

JavaScript (fetch)

function sleep(ms: number) { return new Promise(r => setTimeout(r, ms)); }
 
async function withBackoff(requestFn: () => Promise<Response>, max = 6) {
  let delay = 500;
  for (let i = 0; i < max; i++) {
    const res = await requestFn();
    if (res.status !== 429) return res;
    await sleep(delay + Math.floor(Math.random() * 250));
    delay = Math.min(delay * 2, 8000);
  }
  throw new Error('429 retries exhausted');
}
 
// Usage (OpenAI-compatible example)
await withBackoff(() => fetch('https://api.kushrouter.com/api/openai/v1/chat/completions', { /* init */ }));

Python (requests)

import time, random, requests
 
def with_backoff(do_request, max_attempts=6):
    delay = 0.5
    for _ in range(max_attempts):
        r = do_request()
        if r.status_code != 429:
            return r
        time.sleep(delay + random.uniform(0, 0.25))
        delay = min(delay * 2, 8)
    raise RuntimeError('429 retries exhausted')
 
# Usage (OpenAI-compatible example)
with_backoff(lambda: requests.post('https://api.kushrouter.com/api/openai/v1/chat/completions', json={
    'model': 'gpt-5-mini-2025-08-07',
    'messages': [{'role': 'user', 'content': 'Hello'}]
}))

Concurrency guidance

Start with 2–5 concurrent requests and increase gradually while monitoring 429s.
For sustained high throughput, coordinate concurrency across workers.
Avoid holding open connections unnecessarily (close streams when done).

Streaming considerations

Streaming can improve responsiveness but still counts toward rate limits.
If a stream is aborted, clients should close the connection to free up capacity.

Contact and upgrades

If you routinely hit limits, consider upgrading your plan or contacting support with your x-request-id examples and target throughput.