Rate Limiting APIs in Go: Algorithms & Implementation

Rate limiting is critical for AI-powered APIs — unbounded requests mean unbounded LLM costs. A single bad actor or runaway client can exhaust your monthly API budget in hours. Here are the three rate limiting algorithms and their Go implementations, along with when each is appropriate.

The three algorithms

1. Token Bucket

Tokens fill at a fixed rate; each request costs tokens. Allows short bursts up to bucket capacity.

Capacity: 100 tokens
Refill: 10 tokens/second
Request cost: 1 token per API call

Burst behavior: client can fire 100 requests instantly,
then is limited to 10 requests/second thereafter

Best for: user-facing APIs where occasional bursts are acceptable.

2. Sliding Window Counter

Count requests in the last N seconds. Reject if count > limit.

Window: 60 seconds
Limit: 100 requests/window

At t=45: user has made 95 requests in [0, 45]
At t=50: user makes request #96 → check [50-60, 50] = last 60s
→ If only 90 requests in window, allow
→ If 100 in window, reject with 429

Best for: strict per-period quotas (100 requests per minute, exactly).

3. Leaky Bucket (request queue)

Requests queue; queue drains at fixed rate regardless of burst.

Queue capacity: 50 requests
Drain rate: 10 requests/second

Client sends 100 requests instantly:
→ 50 queued, 50 rejected (queue full)
→ Queued requests processed at 10/sec

Best for: smoothing traffic and preventing backend overload (not so useful for user-facing rate limits).

Token bucket in Go

// internal/ratelimit/tokenbucket.go
package ratelimit

import (
    "context"
    "sync"
    "time"
)

type TokenBucket struct {
    capacity float64
    tokens   float64
    refill   float64  // tokens per second
    lastTime time.Time
    mu       sync.Mutex
}

func NewTokenBucket(capacity, ratePerSecond float64) *TokenBucket {
    return &TokenBucket{
        capacity: capacity,
        tokens:   capacity,
        refill:   ratePerSecond,
        lastTime: time.Now(),
    }
}

func (b *TokenBucket) Allow() bool {
    b.mu.Lock()
    defer b.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(b.lastTime).Seconds()
    b.lastTime = now

    // Refill tokens
    b.tokens = min(b.capacity, b.tokens+elapsed*b.refill)

    if b.tokens < 1 {
        return false
    }
    b.tokens--
    return true
}

func min(a, b float64) float64 {
    if a < b { return a }
    return b
}

Per-user rate limiting with Redis

For distributed rate limiting (multiple server instances), use Redis:

// internal/ratelimit/redis.go
type RedisRateLimiter struct {
    rdb      *redis.Client
    limit    int
    window   time.Duration
}

// Sliding window using Redis sorted sets
func (r *RedisRateLimiter) Allow(ctx context.Context, key string) (bool, int, error) {
    now := time.Now().UnixMilli()
    windowStart := now - r.window.Milliseconds()

    pipe := r.rdb.Pipeline()
    // Remove expired entries
    pipe.ZRemRangeByScore(ctx, key, "-inf", strconv.FormatInt(windowStart, 10))
    // Count remaining in window
    countCmd := pipe.ZCard(ctx, key)
    // Add current request
    pipe.ZAdd(ctx, key, redis.Z{Score: float64(now), Member: now})
    // Set expiry
    pipe.Expire(ctx, key, r.window)

    if _, err := pipe.Exec(ctx); err != nil {
        return true, 0, err  // fail open on Redis error
    }

    count := int(countCmd.Val())
    if count >= r.limit {
        return false, 0, nil  // rejected
    }
    return true, r.limit - count - 1, nil  // allowed, remaining count
}

Usage in Chi middleware:

func RateLimitMiddleware(limiter *RedisRateLimiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // Key by user ID (from JWT) or IP as fallback
            key := "rl:" + extractUserKey(r)
            
            allowed, remaining, err := limiter.Allow(r.Context(), key)
            if err != nil {
                // Redis error — fail open (allow request)
                next.ServeHTTP(w, r)
                return
            }
            
            // Set rate limit headers (RFC 6585)
            w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.limit))
            w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(remaining))
            
            if !allowed {
                w.Header().Set("Retry-After", "60")
                http.Error(w, `{"error":"rate limit exceeded"}`, http.StatusTooManyRequests)
                return
            }
            
            next.ServeHTTP(w, r)
        })
    }
}

Tiered limits for AI endpoints

LLM-backed endpoints need stricter limits than regular API endpoints:

var limits = map[string]RateLimit{
    "/v1/generate/video":    {100, time.Hour},     // expensive: 100/hour
    "/v1/generate/script":   {500, time.Hour},     // cheaper: 500/hour
    "/v1/search":            {1000, time.Minute},  // fast: 1000/min
    "/v1/users":             {5000, time.Minute},  // cheap: 5000/min
}

func adaptiveLimiter(r *http.Request) RateLimit {
    for path, limit := range limits {
        if strings.HasPrefix(r.URL.Path, path) {
            return limit
        }
    }
    return RateLimit{1000, time.Minute} // default
}

Handling rate limit headers on the client

The server sets standard headers; clients should respect them:

// TypeScript client
async function apiCall(url: string): Promise<Response> {
    const resp = await fetch(url)
    
    if (resp.status === 429) {
        const retryAfter = parseInt(resp.headers.get("Retry-After") ?? "60")
        await sleep(retryAfter * 1000)
        return apiCall(url)  // retry once after backoff
    }
    
    return resp
}

FAQ

What's the difference between token bucket and sliding window rate limiting? Token bucket allows short bursts (up to bucket capacity) before limiting to the refill rate. Sliding window counts requests in the last N seconds and rejects if over the limit — no bursting. Use token bucket for UX-friendly APIs, sliding window for strict per-period quotas.

Should I rate limit by user ID or IP address? By user ID whenever possible (requires authentication). IP-based rate limiting is easier to bypass (VPN, multiple IPs) and can unfairly affect users behind shared NAT. IP is a fallback for unauthenticated endpoints.

What happens if Redis goes down and you use Redis for rate limiting? Fail open (allow requests) — the alternative is failing closed (reject all requests), which is worse. Log the Redis failure and alert. A brief Redis outage causes temporary unthrottled traffic, which is survivable; rejecting all requests due to a Redis outage is not.

How do you rate limit LLM API calls specifically? Two levels: (1) your own API rate limits prevent individual users from abusing your service; (2) provider-level limits need client-side exponential backoff with jitter when you hit them. Implement both.

Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: Go Microservices: Patterns I Use in Production · LLM Cost Optimization: Cut Your AI API Bills 10x.