System Design for AI Products: Patterns & Trade-offs

Q: How do I handle LLM API rate limits in production?

Use a fallback chain (primary model → cheaper model → free model). Add an in-memory or Redis rate limit tracker that routes to fallback before hitting the API rate limit.

Q: Should LLM calls be synchronous or asynchronous?

Async whenever the task takes > 1–2 seconds or when the user can continue working while waiting. Sync only for features where the result is immediately needed and latency is < 2 seconds.

Q: How do you prevent LLM output from breaking downstream code?

Schema validation + one retry with explicit format reminder. For critical paths (payments, user data writes), require structured outputs (JSON mode) and validate schema before acting on the result.

Q: What's the right caching TTL for LLM outputs?

Depends on freshness requirements. Concept explanations: 24–72 hours. Product recommendations: 5–30 minutes. Personalized responses: 0 (don't cache). Match TTL to how often the underlying data changes.

AI products fail in ways that purely-deterministic systems don't: LLM APIs time out under load, model outputs are non-deterministic (same input, different output), and quality degrades silently rather than erroring loudly. Here are the system design patterns that handle these challenges.

The fundamental asymmetry: LLM calls are slow and expensive

The core constraint that drives all AI product system design:

LLM API call: 300ms–10 seconds, $0.0001–$0.10
Database query: 1ms–50ms, ~$0
External HTTP call: 50ms–500ms, ~$0

Every architectural decision should account for this asymmetry. Patterns that work for fast, cheap operations often don't work for slow, expensive LLM calls.

Pattern 1: Async-first for user-facing latency

Synchronous LLM calls in the request path:

User → API → LLM (2–5 seconds) → Response
User stares at spinner

Async with polling or WebSocket:

User → API → Job enqueued → Immediate response (job ID)
User → Poll /status/{job_id} or WebSocket
LLM completes → Status update pushed

QuantumSketch uses this pattern — video generation takes 90+ seconds. The API returns a job ID immediately; the client polls for status.

Even for < 2-second operations, consider async if the user can do other things while waiting. "Generating your explanation..." with a progress indicator is better UX than a 2-second freeze.

Pattern 2: Tiered model routing

type RoutePolicy struct {
    Default   string            // model for unspecified requests
    ByTask    map[string]string // task → model
    FallbackTo string           // if primary rate-limited
}

var BikroyBuddyPolicy = RoutePolicy{
    Default: "claude-haiku-4-5",
    ByTask: map[string]string{
        "negotiate":      "claude-sonnet-4-6",
        "product_search": "",  // no LLM needed, pure DB
        "intent_classify": "claude-haiku-4-5",
    },
    FallbackTo: "meta-llama/llama-3.1-70b-instruct:free",
}

Route each task type to the cheapest model that meets quality requirements. Test quality per task type with the candidate models before committing. The fallback model handles rate-limited periods gracefully.

Pattern 3: Caching for repeated queries

LLM outputs can be cached when:

The same input produces the same desired output (deterministic use cases)
"Fresh enough" output is acceptable (not real-time personalization)

func (s *Service) GetExplanation(ctx context.Context, conceptID, userLevel string) (string, error) {
    cacheKey := fmt.Sprintf("explanation:%s:%s", conceptID, userLevel)
    
    // Check cache first
    if cached, err := s.redis.Get(ctx, cacheKey).Result(); err == nil {
        return cached, nil
    }
    
    // Generate via LLM
    explanation, err := s.llm.Complete(ctx, buildExplanationPrompt(conceptID, userLevel))
    if err != nil {
        return "", err
    }
    
    // Cache for 24 hours (explanations don't change often)
    s.redis.Set(ctx, cacheKey, explanation, 24*time.Hour)
    return explanation, nil
}

In offSchool, concept explanations are cached per (concept, mastery_level) pair. Cache hit rate is ~70% — meaning 70% of explanation requests cost zero LLM tokens.

Pattern 4: Fallback chains for reliability

LLM APIs have ~99.5% availability. For high-reliability products, that means ~43 minutes of downtime per month. Fallback chains handle provider outages:

async def complete_with_fallback(prompt: str, task_type: str) -> str:
    providers = [
        ("anthropic", "claude-sonnet-4-6"),
        ("anthropic", "claude-haiku-4-5"),      # cheaper anthropic fallback
        ("openrouter", "meta-llama/llama-3.1-70b-instruct:free"),  # free fallback
    ]
    
    last_err = None
    for provider, model in providers:
        try:
            return await call_llm(provider, model, prompt, timeout=10.0)
        except (TimeoutError, RateLimitError, APIError) as e:
            last_err = e
            log.warning(f"LLM fallback: {provider}/{model} failed: {e}")
            continue
    
    raise RuntimeError(f"All LLM providers failed. Last error: {last_err}")

The fallback chain degrades gracefully: primary → cheaper primary model → free open-source model. Users see slower or slightly lower-quality responses rather than errors.

Pattern 5: Output validation gates

LLM outputs are non-deterministic. A prompt that works 99% of the time fails 1% of the time in ways that break downstream logic.

type OutputValidator struct {
    schemas map[string]json.Schema
}

func (v *OutputValidator) Validate(taskType, output string) error {
    switch taskType {
    case "classify_product":
        var result struct {
            Category  string  `json:"category"`
            Confidence float64 `json:"confidence"`
        }
        if err := json.Unmarshal([]byte(output), &result); err != nil {
            return fmt.Errorf("invalid JSON: %w", err)
        }
        if result.Category == "" {
            return fmt.Errorf("empty category")
        }
        if result.Confidence < 0 || result.Confidence > 1 {
            return fmt.Errorf("confidence out of range: %f", result.Confidence)
        }
    }
    return nil
}

// Usage: validate before storing or acting on output
output, err := llm.Complete(ctx, prompt)
if err := validator.Validate(taskType, output); err != nil {
    // Retry once with explicit format reminder
    output, err = llm.Complete(ctx, prompt + "\n\nIMPORTANT: Output ONLY valid JSON.")
    if err := validator.Validate(taskType, output); err != nil {
        return fmt.Errorf("LLM produced invalid output after retry: %w", err)
    }
}

One retry with a format reminder resolves ~80% of validation failures. If the second attempt fails, it's a prompt design issue.

Pattern 6: Idempotent LLM jobs

Temporal.io provides durable execution, but the LLM call inside must be idempotent at the Temporal level:

func GenerateVideoWorkflow(ctx workflow.Context, jobID string, prompt string) error {
    // ActivityOptions with RetryPolicy: LLM calls may fail transiently
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: 2 * time.Minute,
        RetryPolicy: &temporal.RetryPolicy{
            MaxAttempts: 3,
        },
    }
    
    // Each activity records its output; Temporal replays the recorded result
    // on replay — the LLM is NOT called again on workflow replay
    var script ScriptResult
    if err := workflow.ExecuteActivity(
        workflow.WithActivityOptions(ctx, ao),
        GenerateScript, jobID, prompt,
    ).Get(ctx, &script); err != nil {
        return err
    }
    // ... rest of workflow
}

Temporal records activity outputs. On workflow replay (after crash), recorded outputs are used — the LLM is not called again. This means the script generated on the first run is the one used, even if the workflow restarts.

System design checklist for AI products

[ ] LLM calls are async or have < 2s SLA
[ ] Model tiering: cheapest model for each task type
[ ] Fallback chain for LLM provider failures
[ ] Output validation with one retry
[ ] Caching for deterministic outputs (concepts, templates)
[ ] Cost tracking per task type
[ ] Rate limiting on LLM-backed endpoints
[ ] Prompt caching for stable system prompts

FAQ

How do I handle LLM API rate limits in production? Use a fallback chain (primary model → cheaper model → free model). Add an in-memory or Redis rate limit tracker that routes to fallback before hitting the API rate limit.

Should LLM calls be synchronous or asynchronous? Async whenever the task takes > 1–2 seconds or when the user can continue working while waiting. Sync only for features where the result is immediately needed and latency is < 2 seconds.

How do you prevent LLM output from breaking downstream code? Schema validation + one retry with explicit format reminder. For critical paths (payments, user data writes), require structured outputs (JSON mode) and validate schema before acting on the result.

What's the right caching TTL for LLM outputs? Depends on freshness requirements. Concept explanations: 24–72 hours. Product recommendations: 5–30 minutes. Personalized responses: 0 (don't cache). Match TTL to how often the underlying data changes.

Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: Temporal.io for Long-Running GenAI Workflows · LLM Cost Optimization.