Your API launches. Traffic is steady. Then a bot starts hammering your login endpoint with thousands of requests per second. Or a misbehaving client enters a retry loop. Or someone launches a brute-force attack against your authentication service. Your database buckles under load. Response times spike. Your perfectly healthy system is now on its knees.
Rate limiting is how you prevent this. It caps the number of requests a client can make within a given time window. Anything beyond that limit gets rejected immediately. Simple concept, but the implementation decisions matter a lot.
TL;DR
- Rate limiting caps requests per user/IP/token within a time window
- HTTP 429 (Too Many Requests) is the standard rejection response
- Placement options include proxy/gateway level, application middleware, or embedded as a library
- Redis is the go-to store because it's fast, supports key expiration, and has atomic
INCR/EXPIREcommands - Five algorithms each with trade-offs: Fixed Window, Sliding Window Log, Sliding Window Counter, Token Bucket, Leaky Bucket
- Build a library, not a service: packaging rate limiting as a library that talks directly to Redis eliminates unnecessary network hops
- Scaling means sharding Redis by user ID or IP, since the bottleneck is compute (writes on every request), not storage
- Race conditions in concurrent environments are solved with Lua scripts or Redis sorted sets
What is Rate Limiting

A rate limiter sits in the request path and decides whether to allow or reject each request. If the client is within the allowed threshold, the request goes through. If they've exceeded the limit, the request is rejected immediately.
When a request is rejected, the standard HTTP response is 429 Too Many Requests. Along with the status code, well-designed rate limiters return headers that help clients understand their current state:
CODE
X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 0
X-Ratelimit-Retry-After: 30X-Ratelimit-Limit tells the client how many requests are allowed per window. X-Ratelimit-Remaining shows how many are left. X-Ratelimit-Retry-After tells the client how long to wait before trying again.
Why Rate Limit
Prevent abuse and attacks. Brute-force login attempts, DDoS attacks, and scraping bots all generate high request volumes. Rate limiting stops them before they can cause damage.
Reduce cost. If you're using paid third-party APIs (payment gateways, credit checks, SMS providers), every call costs money. Rate limiting prevents runaway costs from misbehaving clients.
Protect your infrastructure. Even without malicious intent, a single client in a retry loop can overwhelm your database. Rate limiting ensures no single client can monopolize your resources.
Almost every major platform enforces rate limits. Twitter limits tweet creation to 300 per 3 hours. Google Docs API allows 300 read requests per user per 60 seconds. Stripe throttles API requests per key. It's table stakes for any production API.
Where Does a Rate Limiter Fit
There are two fundamental approaches to placing a rate limiter. Each has trade-offs.
At the Proxy or API Gateway

If your architecture has a front-end proxy (NGINX, HAProxy) or an API gateway, you can check the rate limiter there. When a request arrives, the proxy consults the rate limiter. If the request is over the limit, it returns 429 immediately. The request never reaches your backend service at all.
TYPESCRIPT
async function proxyHandler(
req: Request,
): Promise<Response> {
const clientIp =
req.headers.get("x-forwarded-for") ?? "unknown";
const allowed = await checkRateLimit(clientIp);
if (!allowed) {
return new Response("Too Many Requests", {
status: 429,
});
}
return forwardToBackend(req);
}This is ideal when you want to protect your backend from even seeing excess traffic.
As Application Middleware

If you don't have a proxy, or you want per-endpoint granularity, implement rate limiting as middleware in your application. The request reaches your service, but the middleware checks the rate limiter before invoking the actual handler.
TYPESCRIPT
async function rateLimitMiddleware(
req: Request,
next: () => Promise<Response>,
): Promise<Response> {
const userId = extractUserId(req);
const endpoint = new URL(req.url).pathname;
const key = `${userId}:${endpoint}`;
const allowed = await checkRateLimit(key);
if (!allowed) {
return new Response("Too Many Requests", {
status: 429,
headers: { "Retry-After": "30" },
});
}
return next();
}This gives you granular control. You can set different limits per endpoint: maybe your /api/login endpoint allows 5 requests per minute while /api/search allows 100.
Which One to Pick
If you have a proxy and want to shield your entire backend, put the rate limiter at the proxy. If you need per-endpoint or per-user granularity inside your service, use middleware. Many production systems use both: a coarse global limit at the proxy and fine-grained limits in the application.
Choosing the Database
Before picking a database, think about the access pattern. For every incoming request, you need to:
- Read the current request count for this client
- Write an incremented count
- Expire the count after the time window passes
That's a key-value access pattern with expiration. You need something fast, since this check happens on every single request. Disk-based databases are too slow. You need an in-memory store.
Redis fits perfectly. It provides two commands that do exactly what rate limiting needs:
INCRincrements a counter by 1 atomicallyEXPIREsets a TTL on a key so it auto-deletes after the time window
TYPESCRIPT
async function checkRateLimit(
key: string,
): Promise<boolean> {
const current = await redis.incr(`ratelimit:${key}`);
if (current === 1) {
await redis.expire(`ratelimit:${key}`, 60);
}
return current <= 100;
}The first time a key is incremented, it's created with value 1 and you set a 60-second expiry. Every subsequent request increments the counter. After 60 seconds, the key disappears and the cycle starts fresh.
The data model is simple: user_id -> count with a TTL. Redis stores this in memory, so reads and writes are sub-millisecond. For a rate limiter that's consulted on every request, that speed matters.
Rate Limiting Algorithms
There are five common algorithms for rate limiting. Each makes different trade-offs between accuracy, memory usage, and burst handling.
1. Fixed Window Counter
The simplest approach. Divide time into fixed windows (say, 1-minute intervals). Maintain a counter for each window. Increment on every request. Reject when the counter exceeds the limit. When the window ends, the counter resets.
TYPESCRIPT
async function fixedWindow(
key: string,
limit: number,
windowSeconds: number,
): Promise<boolean> {
const windowKey = `${key}:${Math.floor(Date.now() / 1000 / windowSeconds)}`;
const count = await redis.incr(windowKey);
if (count === 1) {
await redis.expire(windowKey, windowSeconds);
}
return count <= limit;
}Pros: Memory efficient. Easy to understand. One counter per key per window.
Cons: Burst problem at window edges.

If the limit is 5 requests per minute, a client can send 5 requests at the end of one window and 5 more at the start of the next. That's 10 requests in a span of just a few seconds, despite the "5 per minute" limit. The algorithm is correct per-window, but the boundary creates a loophole.
2. Sliding Window Log
Instead of fixed windows, track the timestamp of every request. When a new request arrives, remove all timestamps older than the window size. If the remaining count is under the limit, allow it.
TYPESCRIPT
async function slidingWindowLog(
key: string,
limit: number,
windowMs: number,
): Promise<boolean> {
const now = Date.now();
const windowStart = now - windowMs;
await redis.zremrangebyscore(key, 0, windowStart);
await redis.zadd(key, now, `${now}-${Math.random()}`);
await redis.expire(key, Math.ceil(windowMs / 1000));
const count = await redis.zcard(key);
return count <= limit;
}This uses a Redis sorted set where the score is the timestamp. ZREMRANGEBYSCORE removes outdated entries. ZCARD counts what's left.
Pros: Perfectly accurate. No edge-burst problem. Every rolling window is exact.
Cons: Memory-heavy. You're storing every single request timestamp, even for rejected requests. At high volumes, this adds up fast.
3. Sliding Window Counter
A hybrid that approximates the sliding window without storing every timestamp. It uses the counts from the current and previous fixed windows, weighted by how much each window overlaps with the sliding window.
The formula:
CODE
effective_count = current_window_count + previous_window_count * overlap_percentageIf you're 30% into the current window, the previous window contributes 70% of its count.
TYPESCRIPT
async function slidingWindowCounter(
key: string,
limit: number,
windowSeconds: number,
): Promise<boolean> {
const now = Date.now() / 1000;
const currentWindow = Math.floor(now / windowSeconds);
const previousWindow = currentWindow - 1;
const elapsed = (now % windowSeconds) / windowSeconds;
const [currentCount, previousCount] = await Promise.all([
redis
.get(`${key}:${currentWindow}`)
.then((v) => Number(v) || 0),
redis
.get(`${key}:${previousWindow}`)
.then((v) => Number(v) || 0),
]);
const effectiveCount =
currentCount + previousCount * (1 - elapsed);
if (effectiveCount >= limit) {
return false;
}
await redis.incr(`${key}:${currentWindow}`);
await redis.expire(
`${key}:${currentWindow}`,
windowSeconds * 2,
);
return true;
}Pros: Memory efficient (just two counters). Smooths out the edge-burst problem. Cloudflare uses this approach and found only 0.003% of requests were incorrectly handled among 400 million requests.
Cons: It's an approximation. Assumes requests in the previous window were evenly distributed.
4. Token Bucket
Instead of counting requests, imagine a bucket that holds tokens. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which allows short bursts.
TYPESCRIPT
async function tokenBucket(
key: string,
capacity: number,
refillRate: number,
): Promise<boolean> {
const now = Date.now();
const bucket = await redis.hgetall(`bucket:${key}`);
let tokens = Number(bucket.tokens ?? capacity);
const lastRefill = Number(bucket.lastRefill ?? now);
const elapsed = (now - lastRefill) / 1000;
tokens = Math.min(
capacity,
tokens + elapsed * refillRate,
);
if (tokens < 1) {
return false;
}
await redis.hmset(`bucket:${key}`, {
tokens: (tokens - 1).toString(),
lastRefill: now.toString(),
});
await redis.expire(
`bucket:${key}`,
Math.ceil(capacity / refillRate) + 60,
);
return true;
}Pros: Allows controlled bursts. A user who's been idle accumulates tokens and can send a burst of requests. Memory efficient (two values per key). Used by Amazon and Stripe.
Cons: Two parameters to tune (bucket size and refill rate), which can be tricky to get right.
5. Leaky Bucket
Similar to token bucket, but requests are processed at a fixed, constant rate. Incoming requests are added to a queue. If the queue is full, new requests are dropped. Requests are pulled from the queue and processed at a steady rate.
TYPESCRIPT
async function leakyBucket(
key: string,
capacity: number,
leakRate: number,
): Promise<boolean> {
const now = Date.now();
const bucket = await redis.hgetall(`leak:${key}`);
let waterLevel = Number(bucket.level ?? 0);
const lastLeak = Number(bucket.lastLeak ?? now);
const elapsed = (now - lastLeak) / 1000;
waterLevel = Math.max(0, waterLevel - elapsed * leakRate);
if (waterLevel >= capacity) {
return false;
}
await redis.hmset(`leak:${key}`, {
level: (waterLevel + 1).toString(),
lastLeak: now.toString(),
});
await redis.expire(
`leak:${key}`,
Math.ceil(capacity / leakRate) + 60,
);
return true;
}Pros: Produces a perfectly smooth, constant outflow rate. Memory efficient. Used by Shopify.
Cons: A burst of traffic fills the queue with old requests. Recent requests get dropped even if the old ones are no longer relevant. Same tuning challenge as token bucket.
Picking an Algorithm
| Algorithm | Burst Handling | Memory | Accuracy | Good For |
|---|---|---|---|---|
| Fixed Window | Poor (edge bursts) | Very low | Approximate | Simple use cases |
| Sliding Window Log | Exact | High | Perfect | Strict accuracy needs |
| Sliding Window Counter | Smoothed | Low | Very good | Most production systems |
| Token Bucket | Allows bursts | Low | Good | APIs with burst tolerance |
| Leaky Bucket | Prevents bursts | Low | Good | Steady throughput needs |
For most systems, start with fixed window counter. It's simple and works well enough. When you need more precision, move to sliding window counter. If your use case benefits from allowing bursts (like API gateways), consider token bucket.
A Library, Not a Service
This is where realistic system design diverges from textbook diagrams.
The instinct is to build a rate limiter "service" with its own load balancer, API servers, and Redis backend. Draw the boxes, add arrows. It looks impressive on a whiteboard.

But look at what happens when your payment service needs to check the rate limiter: the request goes from your service to the rate limiter's load balancer, to one of its API servers, which talks to Redis, gets the response, sends it back through the load balancer, back to your service. That's 4+ network hops before you've even started processing the actual request.
For something that runs on every single request, those hops add up. Your p99 latency shoots up. You've broken the fundamental requirement that rate limiting should not add massive overhead.
The better approach: make the rate limiter a library. Package the rate limiting logic (the algorithm, the Redis commands) as a library that your services import directly.
TYPESCRIPT
import { createRateLimiter } from "@your-org/rate-limiter";
const limiter = createRateLimiter({
redis: { host: "redis-shard-1.internal", port: 6379 },
algorithm: "sliding-window-counter",
defaultLimit: 100,
defaultWindow: 60,
});
async function handler(req: Request): Promise<Response> {
const userId = extractUserId(req);
const allowed = await limiter.check(userId);
if (!allowed) {
return new Response("Too Many Requests", {
status: 429,
});
}
return processRequest(req);
}Your service imports the library and talks directly to Redis. One network hop. No load balancer. No API server. No unnecessary infrastructure.
"But doesn't that duplicate logic across services?" Yes. That's what libraries are for. You write the logic once, package it as a library (npm package, pip package, Maven artifact), and every service that needs rate limiting imports it. The logic lives in one place (the library code), but runs wherever it's needed.
Your rate limiter "service" is now just Redis and a library. That's it. This is how production systems actually work.
Scaling the Rate Limiter
Since the rate limiter is now just a Redis database and a library, scaling the rate limiter means scaling Redis.
The Storage Math
Let's figure out how much storage we actually need. If you're rate limiting by user ID:
| Field | Size |
|---|---|
| User ID (integer) | 4 bytes |
| Count (integer) | 4 bytes |
| Total per entry | 8 bytes |
For IP-based limiting: IP address (16 bytes) + count (4 bytes) = 20 bytes per entry.
With 100 million users at 20 bytes each, total storage is about 2 GB. That comfortably fits in a single Redis instance. Storage is not the bottleneck.
The Compute Problem
The bottleneck is writes. For every single incoming request, you're doing INCR on a Redis key. At 100,000 requests per second across your platform, that's 100,000 write operations per second hitting Redis. A single Redis instance can handle a lot (around 100k operations/second), but as traffic grows, you'll need to distribute the load.
Vertical Scaling
Start here. Upgrade your Redis instance: more CPU, more RAM, faster network. A single well-provisioned Redis node can handle surprisingly high throughput. Don't shard until you actually need to.
Why Read Replicas Don't Help
This is a write-heavy workload. Every request writes (increments a counter). Read replicas only help when you have a high read-to-write ratio. For rate limiting, reads and writes are roughly 1:1. Adding read replicas doesn't reduce the write load on the primary.

When one Redis node can't handle the write throughput, shard. Since you're rate limiting per user (or per IP or per token), the data is naturally partitioned. Each user's counter only lives on one shard. No cross-shard queries needed.
TYPESCRIPT
function getShardForUser(
userId: string,
shards: Redis[],
): Redis {
const hash = hashCode(userId);
return shards[hash % shards.length];
}
async function checkRateLimit(
userId: string,
): Promise<boolean> {
const shard = getShardForUser(userId, redisShards);
const count = await shard.incr(`ratelimit:${userId}`);
if (count === 1) {
await shard.expire(`ratelimit:${userId}`, 60);
}
return count <= 100;
}The library needs to know about all Redis shards. Store shard addresses in a central config (a file on S3, a database entry, a config service). When the library initializes, it reads the config and creates connections to all shards.
TYPESCRIPT
async function initRateLimiter(): Promise<Redis[]> {
const config = await fetchShardConfig();
return config.shards.map(
(addr) =>
new Redis({ host: addr.host, port: addr.port }),
);
}On each request, hash the user ID, pick a shard, and do the rate limiting operations on that shard. Simple, scalable, minimal overhead.
Race Conditions in Distributed Environments
There's a subtle problem with the basic INCR approach. In a highly concurrent environment, two requests can arrive at the same time:
- Request A reads the counter: value is 99
- Request B reads the counter: value is 99
- Both check: 99 < 100, so both are allowed
- Request A increments to 100
- Request B increments to 101
The limit was 100, but 101 requests got through. This is a classic read-check-write race condition.
Lua Scripts
The most common solution is a Lua script that runs atomically in Redis. Redis executes Lua scripts as a single atomic operation, so no interleaving can happen.
LUA
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call("INCR", key)
if current == 1 then
redis.call("EXPIRE", key, window)
end
if current > limit then
return 0
end
return 1TYPESCRIPT
const RATE_LIMIT_SCRIPT = `
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call("INCR", key)
if current == 1 then
redis.call("EXPIRE", key, window)
end
if current > limit then
return 0
end
return 1
`;
async function atomicRateLimit(
key: string,
limit: number,
windowSeconds: number,
): Promise<boolean> {
const result = await redis.eval(
RATE_LIMIT_SCRIPT,
1,
`ratelimit:${key}`,
limit,
windowSeconds,
);
return result === 1;
}The INCR + conditional EXPIRE + limit check all happen as one atomic operation. No race condition possible.
Sorted Sets
For sliding window log implementations, Redis sorted sets naturally handle atomicity. ZADD, ZREMRANGEBYSCORE, and ZCARD can be combined in a Lua script or a MULTI/EXEC transaction to prevent races.
Hands-on
I've created runnable demos you can clone and run locally: rate-limiter. It is TypeScript on top of Redis with atomic Lua scripts for fixed window, sliding window log, sliding window counter, and leaky bucket, plus a script that shows the read-check-write race versus a Lua-backed limiter under concurrent requests.
Conclusion
Rate limiting protects your system from being overwhelmed, whether by malicious attacks, misbehaving clients, or unexpected traffic spikes. The core mechanics are straightforward: track request counts in Redis, reject when limits are exceeded, return 429.
The key design decisions come down to placement (proxy vs. middleware), algorithm choice (start with fixed window, graduate to sliding window counter), and architecture (library, not service).
Start simple. One Redis instance, a fixed window counter, rate limiting as middleware in your application. As traffic grows, switch to a more accurate algorithm. When one Redis node isn't enough, shard by user ID. Keep the architecture minimal: a database and a library, not a distributed service with its own load balancers.
The goal isn't to build the most sophisticated rate limiter possible. The goal is to protect your system without adding the very overhead you're trying to prevent.