_SH Log's
Back to Root
EST: 5 min read

How I Scaled an AI Agent to 300k+ Users on Kubernetes

Scaling a social-commerce AI agent to 300k+ users meant queueing, autoscaling on Kubernetes, and tight cost control. Here's the architecture that worked.

#kubernetes#scaling#ai-agents#go

Scaling BikroyBuddy — an AI shopping agent for Bangladesh — from 5,000 to 300,000+ users required rethinking every layer of the architecture. The original single-instance Go server with synchronous LLM calls didn't survive 10x growth. Here's what the scaled architecture looks like and what broke along the way.

The original architecture (0–5k users)

WhatsApp → Webhook handler (Go, single EC2) → Claude API → Response

Simple. Worked. Broke at ~8,000 concurrent WebSocket sessions (EC2 t3.medium runs out of memory handling open connections + LLM response buffering simultaneously).

The scaled architecture (300k+ users)

WhatsApp API
  → ALB (AWS Load Balancer)
  → Webhook receivers (Go, stateless, K8s Deployment, 3–20 replicas)
  → SQS FIFO queue (per-conversation ordering)
  → Message workers (Go, K8s Deployment, 10–50 replicas)
      → Intent classifier (Claude Haiku, 100ms SLA)
      → [Branch by intent]
          → Product search (pgvector, <20ms)
          → Negotiation handler (Claude Sonnet + state machine)
          → Simple reply (Claude Haiku)
  → Response sender (Go, WhatsApp API calls)
  → Redis (conversation state, 24-hour TTL)
  → PostgreSQL (user data, product catalog, permanent records)

Key changes: webhook receivers are now stateless (no in-memory state), message processing is async via SQS, and the LLM calls are isolated in worker pods that autoscale independently.

Kubernetes setup on EKS

# Horizontal autoscaling for message workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: bikroybuddy-workers
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bikroybuddy-workers
  minReplicas: 10
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue: bikroybuddy-messages
      target:
        type: AverageValue
        averageValue: "100"  # scale up when >100 msgs/worker

The HPA scales workers based on SQS queue depth (custom metric via KEDA). At 100 messages per worker, add pods. At < 10 messages per worker, scale down. Scale-up takes ~90 seconds (pod scheduling + container start).

The SQS FIFO queue: per-conversation ordering

WhatsApp delivers messages out of order under load. Without ordering, conversation state breaks — the reply to message 1 might arrive before message 1's processing finishes.

SQS FIFO with MessageGroupId = conversation_id ensures messages in the same conversation are processed in order:

func (h *WebhookHandler) Enqueue(msg WhatsAppMessage) error {
    _, err := h.sqs.SendMessage(&sqs.SendMessageInput{
        QueueUrl:               aws.String(h.queueURL),
        MessageBody:            aws.String(encodeMessage(msg)),
        MessageGroupId:         aws.String(msg.ConversationID),
        MessageDeduplicationId: aws.String(msg.MessageID),
    })
    return err
}

FIFO deduplication (MessageDeduplicationId) also handles WhatsApp's at-least-once delivery — duplicate webhooks don't produce duplicate responses.

Redis for conversation state

Each conversation's state (negotiation phase, current offer, product being discussed) lives in Redis with a 24-hour TTL:

type ConversationState struct {
    Phase       NegotiationPhase
    ProductID   string
    LastOffer   int
    Turns       int
    LastUpdated time.Time
}

func (r *RedisStore) GetState(convID string) (*ConversationState, error) {
    data, err := r.client.Get(ctx, "conv:"+convID).Bytes()
    if err == redis.Nil {
        return &ConversationState{Phase: PhaseOpen}, nil // new conversation
    }
    var state ConversationState
    json.Unmarshal(data, &state)
    return &state, nil
}

func (r *RedisStore) SaveState(convID string, state *ConversationState) error {
    data, _ := json.Marshal(state)
    return r.client.Set(ctx, "conv:"+convID, data, 24*time.Hour).Err()
}

Redis Cluster mode with 3 shards handles the ~300k active conversation states. Memory: ~2KB per conversation state × 300k = ~600MB (well within cluster capacity).

Cost at 300k users

| Service | Monthly cost | |---------|-------------| | EKS cluster (3 m5.large nodes) | $220 | | Worker pods (avg 20 replicas, spot) | $180 | | SQS | $12 | | ElastiCache Redis Cluster | $130 | | RDS PostgreSQL (db.r6g.large) | $190 | | Claude Haiku (intent classify) | $280 | | Claude Sonnet (negotiations) | $420 | | WhatsApp API (Meta) | $0 (conversation-based pricing, mostly free tier) | | ALB + networking | $60 | | Total | ~$1,492/mo |

At 300k users, ~$0.005/user/month. Subscription revenue covers this with 4× margin.

What broke at each growth stage

10k users: Connection pool exhaustion on PostgreSQL. Fix: pgBouncer in transaction mode.

50k users: Redis single instance OOM. Fix: Redis Cluster with 3 shards.

100k users: Claude Sonnet latency spikes under concurrent load (>10 parallel calls to Claude API). Fix: intent classifier routes only genuine negotiation to Sonnet; 75% of requests now go to Haiku.

200k users: EKS node autoscaler too slow for traffic spikes (WhatsApp usage peaks sharply at 7pm local). Fix: predictive scaling (pre-warm 15 minutes before predicted peak).

FAQ

How many users can a single Go process handle? Depends on workload. For stateless webhook receivers, a single t3.medium handles ~500 concurrent requests comfortably. For LLM-powered workers with blocking API calls, each worker handles one request at a time per goroutine — parallelism comes from replicas, not concurrency within a pod.

Why Kubernetes instead of ECS for this scale? At 300k users with 50+ pods, ECS Fargate costs became prohibitive. EKS with spot instances on Graviton2 nodes reduced compute costs by ~40%. For smaller scales, ECS Fargate is simpler.

How do you handle WhatsApp rate limits? Meta's WhatsApp Business API has conversation-based pricing and rate limits per phone number. BikroyBuddy uses multiple numbers (one per region) with a router that distributes load across them.

What's the P99 response latency? Haiku-powered responses: 800ms P99. Sonnet-powered negotiations: 2.1s P99. The 2.1s is acceptable for negotiation but would be unacceptable for a simple reply — hence the tiered routing.


Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: How I Built BikroyBuddy · Microservices as One Engineer.