What Is Rate Limiting?

Rate limiting is a technique for controlling the number of requests a client can make to a server within a specified time window. When a client exceeds the allowed rate, subsequent requests are either rejected (typically with an HTTP 429 Too Many Requests response), delayed, or queued. Rate limiting is a fundamental defense mechanism that protects servers from abuse, ensures fair resource allocation, and maintains service availability under heavy load.

Every high-traffic website, API, and web application uses some form of rate limiting. Without it, a single misbehaving client — whether malicious or simply buggy — can consume all available server resources and deny service to everyone else.

How Rate Limiting Works

At its core, rate limiting tracks the number of requests from each client (usually identified by IP address, API key, or user account) and compares that count against a predefined threshold. The implementation details vary by algorithm, but the basic flow is consistent:

A request arrives and the server identifies the client.
The server checks how many requests this client has made in the current time window.
If the count is below the threshold, the request is processed normally and the counter increments.
If the count exceeds the threshold, the request is rejected or throttled.

Rate Limiting Methods

Several algorithms are commonly used to implement rate limiting, each with different trade-offs:

Fixed window — The simplest approach. Time is divided into fixed intervals (e.g., one-minute windows), and each client is allowed a set number of requests per window. The counter resets at the start of each window. The downside is that a client can send a burst of requests at the end of one window and the start of the next, effectively doubling their rate briefly.
Sliding window — A refinement of the fixed window approach. Instead of resetting at fixed boundaries, the window slides continuously. The algorithm considers the weighted request count from the previous window combined with the current window, smoothing out the burst problem that fixed windows create.
Token bucket — Clients are assigned a bucket that fills with tokens at a steady rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which allows short bursts of traffic (up to the bucket size) while enforcing a sustained average rate. This is the most flexible algorithm and is widely used in production systems.
Leaky bucket — Similar to the token bucket, but requests are processed at a fixed rate regardless of arrival pattern. Incoming requests that exceed capacity overflow and are dropped. This enforces a perfectly smooth output rate at the cost of rejecting burst traffic.

Rate Limiting for DDoS Protection

Rate limiting is a critical layer of DDoS defense, particularly against application-layer (Layer 7) attacks. Volumetric attacks that flood the network can be handled by upstream bandwidth, but application-layer attacks like HTTP floods send seemingly legitimate requests designed to exhaust server resources. Rate limiting detects and blocks clients that exceed normal request patterns, stopping flood attacks before they overwhelm your application.

Effective DDoS rate limiting operates at the edge — at the WAF or CDN level — so malicious traffic is dropped before it reaches your origin server. Server-level rate limiting with iptables provides an additional layer of defense, but edge-level enforcement is essential for stopping large-scale attacks.

WAF and API Rate Limiting

Web application firewalls use rate limiting as one of their core detection mechanisms. A WAF can apply different rate limits to different URL paths, so your login page (a common target for brute force attacks) can have a strict limit of 10 requests per minute while your homepage allows 100 requests per minute from the same client.

API rate limiting is equally important. Public APIs without rate limits are quickly abused — by scrapers, by competitors, or by poorly written client code that makes thousands of unnecessary calls. API rate limits are typically communicated through response headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) so well-behaved clients can adjust their request patterns accordingly.

What Is Rate Limiting? | NOC.org