Building freelm: A Free LLM Gateway in a Day

I kept paying for LLM calls in side projects that barely needed them, while six providers were handing out free tiers I wasn't using together. So I built freelm: an open-source Python client that pools OpenRouter, Google AI Studio, NVIDIA NIM, Groq, Cerebras, and Mistral behind one OpenAI-compatible call, with automatic key rotation, failover, and streaming. This is how it came together.

The problem I actually had

Every prototype I start needs an LLM. Paying per token for a demo that might get thrown away is annoying; wiring up one provider's free tier and then hitting its rate limit mid-demo is worse. The free capacity exists — Gemini gives ~1,500 requests/day, Groq is fast and free, OpenRouter has dozens of :free models — but each has its own SDK, auth, and limits. I wanted one call that used all of them and never fell over.

The core decision: one clean layer plus a shim

All six providers expose OpenAI-compatible HTTP, so I built one normalized client instead of six. The design is a small core — FreeLLM and AsyncFreeLLM — plus an OpenAI-compatible shim so existing code swaps in unchanged. I deliberately kept the dependency list to just httpx; everything else is standard-library dataclasses. A reliability library you drop into any project shouldn't drag a tree of dependencies with it.

Failover is the whole point

The hard part isn't calling an API; it's what happens when it says no. freelm classifies each failure and reacts: a 429 cools that key and rotates; a 5xx or timeout trips a per-key circuit breaker with backoff; a 401 disables a dead key; an unknown model falls through to the next. The candidate order interleaves across providers — best model of each provider first — so a flood of errors from one provider can't starve the others. That ordering bug was the first thing my live tests caught, and fixing it is what made "always-up" actually true.

from freelm import FreeLLM
llm = FreeLLM.from_env(strategy="quota_aware")
print(llm.text("hi"))   # rotates keys + fails over across 6 providers

Live model discovery, because IDs rot

My first version hardcoded model IDs. Within a day, two of them 404'd — free models churn constantly. So freelm now queries each provider's /models endpoint on first use, tags the results, caches them to disk, and only falls back to a built-in list when offline. The order is live API → cache → hardcoded. That one change turned a brittle list into something that self-heals as providers add and retire models.

What testing against real APIs taught me

I live-tested all six providers end-to-end — chat, streaming, and discovery. Reality corrected my assumptions: Cerebras is token-limited (1M/day), not request-limited; Mistral's free tier is a strict 2 requests/minute; and auto was picking a slow 550B reasoning model that buried the answer in hidden thinking. I added name-based detection so default calls lead with a fast plain instruct model. None of that was visible from the docs alone — only from running it.

Shipping it

freelm is on PyPI (pip install freelm) and open source on GitHub, MIT-licensed, with CI across Python 3.9–3.14 and a tagged release per version. It's deliberately scoped to free providers — when I tried adding a paid one, it didn't fit the premise, so I removed it. The goal is narrow and useful: free LLM access that stays up.

Frequently Asked Questions

Why not just use one provider's free tier? One free tier is capped and fails alone. Pooling six independent free tiers multiplies capacity and removes the single point of failure.

How long did it take? The working core — six providers, failover, streaming, discovery, tests — came together in about a day, then a few iterations from live testing to fix real-world quirks.

Is freelm production-ready? For prototypes, side projects, and bursty internal tools, yes. For high-volume production, use it as a cheap free fallback behind a paid primary.

What's next for freelm? Persistent quota tracking across restarts, then JS/TS and Go ports — the core is intentionally language-agnostic.

Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. freelm is open source: PyPI · GitHub.