Transactional Email Rate Limit: Token Bucket Guide

A transactional email rate limit is not one number. It is at least two — the ceiling your upstream provider enforces, and the softer cap you impose on yourself to keep ISPs from throttling your domain. Get either wrong and you stop delivering password resets at the worst possible moment.

This guide covers how to design rate limits on your sending path: which algorithm to pick, a Redis-backed token bucket for a Python worker, how to throttle per recipient without a key explosion, and what to do when the queue grows faster than you can drain it.

Why one limit is never enough

If you only model the provider's send-per-second number, you will hit two failure modes within a week of growth.

The first is bursty rejection. Your producer publishes 5,000 receipts in a 200 ms window because a batch job finished. The provider accepts the first 200 and 429s the rest. You retry without pacing, the provider 429s again, the queue grows.

The second is invisible reputation damage. The provider accepts every message, but Gmail starts returning 421-4.7.0 Try again later because you sent 800 messages to @gmail.com in 30 seconds from a warming IP. The provider's dashboard shows green. Your users see nothing.

You need both a hard limit (don't exceed what the provider accepts) and a soft limit (don't exceed what each ISP tolerates). One is per-account global, one is per-recipient-domain.

The two limits you actually need

Upstream provider rate

Every transactional email provider publishes a send rate, usually in messages per second per account, sometimes per IP. SES starts new accounts at 1 msg/s and grows on request. SendGrid, Postmark, and SESMetric all enforce a similar account-wide cap exposed via API or dashboard.

This limit is synchronous and authoritative. Exceed it and the provider returns an error before the message leaves your network. The token bucket below enforces it on your side.

Per-recipient (or per-domain) cap

ISPs don't publish their thresholds, but the patterns are well known. Gmail and Yahoo defer mail with a 4xx SMTP code when one sending domain or IP sends too many messages too quickly to their users. The deferral is silent from your application's perspective — the provider accepted the message; the inbox just hasn't received it.

The fix is to throttle per recipient domain. Don't aim 80% of your 100 msg/s budget at @gmail.com for ten seconds straight. Cap each destination domain at a fraction of the total and queue what doesn't fit.

Picking the algorithm: token bucket vs leaky bucket vs fixed window

Three algorithms show up repeatedly. They are not interchangeable.

Fixed window

You count requests in a rolling minute. At second 59 you've sent 99, at second 60 the counter resets, and the next batch of 100 fires immediately. The result is a 2x burst across the window boundary that breaks any provider expecting smoothed traffic. Skip it for email.

Leaky bucket

Requests enter a fixed-capacity queue and drain at a constant rate. It guarantees a smooth output rate, which is great for the ISP-facing side. The downside is no burst credit — under quota for an hour, you still cannot fire 1,000 messages when traffic resumes. Transactional mail spikes on real events, so you usually want burst tolerance.

Token bucket

A bucket holds N tokens and refills at R tokens per second. Each send consumes one token. If the bucket is empty, the send waits or is rejected. Unused tokens accumulate up to N, so a quiet period earns burst capacity for the next spike. This is the right default for transactional email rate limit enforcement: it matches what providers do on their side and survives traffic bursts gracefully.

Use token bucket for the global account limit and a separate, smaller bucket per recipient domain.

A Redis-backed token bucket in Python

Token bucket math is trivial. The hard part is atomicity across many workers. A naive GET; compute; SET loses tokens under contention. Use a Lua script — Redis runs it atomically inside the server.

The script stores two fields per bucket: token count and last refill timestamp. On every call it lazily refills based on elapsed time, then either consumes a token and returns 1, or returns 0 with the milliseconds until the next token.

-- token_bucket.lua
-- KEYS[1] = bucket key
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = now in milliseconds
-- ARGV[4] = tokens requested (usually 1)
-- Returns: {allowed (0|1), tokens_remaining, retry_after_ms}

local capacity   = tonumber(ARGV[1])
local rate       = tonumber(ARGV[2])
local now_ms     = tonumber(ARGV[3])
local requested  = tonumber(ARGV[4])

local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1])
local last   = tonumber(data[2])

if tokens == nil then
  tokens = capacity
  last   = now_ms
end

local elapsed = math.max(0, now_ms - last)
local refill  = (elapsed / 1000.0) * rate
tokens = math.min(capacity, tokens + refill)

local allowed = 0
local retry_after_ms = 0
if tokens >= requested then
  tokens = tokens - requested
  allowed = 1
else
  local deficit = requested - tokens
  retry_after_ms = math.ceil((deficit / rate) * 1000)
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now_ms)
-- Expire after the time it would take to fully refill from empty, plus slack.
redis.call('PEXPIRE', KEYS[1], math.ceil((capacity / rate) * 1000) + 5000)

return {allowed, tokens, retry_after_ms}

The Python wrapper loads the script once with SCRIPT LOAD and calls it by SHA. That avoids shipping the source on every call.

# rate_limit.py
import time
from dataclasses import dataclass
from pathlib import Path

import redis

_SCRIPT_PATH = Path(__file__).with_name("token_bucket.lua")


@dataclass
class BucketSpec:
    capacity: int          # max tokens
    refill_per_sec: float  # tokens added per second


@dataclass
class Decision:
    allowed: bool
    tokens_remaining: float
    retry_after_ms: int


class TokenBucket:
    def __init__(self, client: redis.Redis):
        self.r = client
        self._sha = self.r.script_load(_SCRIPT_PATH.read_text())

    def consume(self, key: str, spec: BucketSpec, n: int = 1) -> Decision:
        now_ms = int(time.time() * 1000)
        allowed, remaining, retry_after = self.r.evalsha(
            self._sha,
            1,
            key,
            spec.capacity,
            spec.refill_per_sec,
            now_ms,
            n,
        )
        return Decision(
            allowed=bool(allowed),
            tokens_remaining=float(remaining),
            retry_after_ms=int(retry_after),
        )

On the send path, check the global bucket and the per-domain bucket before handing the message to the provider. If either denies, requeue with a delay equal to the longer retry_after_ms.

# sender.py
from rate_limit import TokenBucket, BucketSpec

GLOBAL = BucketSpec(capacity=200, refill_per_sec=100.0)    # 100 msg/s, burst 200
PER_DOMAIN = BucketSpec(capacity=20, refill_per_sec=10.0)  # 10 msg/s per ISP


def try_send(bucket: TokenBucket, message) -> bool:
    domain = message.to.split("@", 1)[1].lower()

    g = bucket.consume("tb:global", GLOBAL)
    if not g.allowed:
        message.requeue_after_ms(g.retry_after_ms)
        return False

    d = bucket.consume(f"tb:domain:{domain}", PER_DOMAIN)
    if not d.allowed:
        # Refund the global token we just spent — the message did not leave.
        bucket.consume("tb:global", BucketSpec(GLOBAL.capacity, -GLOBAL.refill_per_sec))
        message.requeue_after_ms(d.retry_after_ms)
        return False

    provider.send(message)
    return True

The refund is approximate — a negative consume credits the global bucket back. Under heavy contention you may double-spend by a token or two; well within the burst headroom.

Per-recipient throttling without melting Redis

Per-domain buckets are cheap because cardinality equals the number of recipient ISPs — usually under 200 even for large senders. If you go per-recipient-address (rare, mostly for abuse mitigation), shard the key space and set a short TTL so dormant addresses don't pin memory. The PEXPIRE in the Lua script handles this. Pick a key prefix that makes cardinality obvious: tb:domain:gmail.com beats tb:rl:7a3c.

Queue back-pressure when the downstream is slower than the producer

The rate limiter denies requests. Your producer needs somewhere to put the denied messages, and that somewhere needs to push back when it fills. That is back-pressure.

The three common queue choices each handle it differently:

Amazon SQS: lengthen visibility timeout on retries, use DelaySeconds for backoff, watch ApproximateNumberOfMessagesVisible. SQS won't slow your producer for you — you read the depth and pause when it crosses a threshold.
Kafka: consumer lag is your signal. If consumer_lag on the send topic exceeds N seconds of traffic, pause the producer or drop low-priority traffic. Kafka itself won't reject writes until the broker disk fills.
Redis Streams: bound with MAXLEN ~ N — the trim is approximate but cheap. When the stream length hits the cap, XADD evicts the oldest entries. That is silent data loss unless the producer reads the trim signal and slows first.

Whichever you pick, the pattern is the same. The send worker reports queue depth and rejection rate to a control plane. The producer reads those and slows its enqueue rate, sheds non-critical traffic, or returns a 503 to the calling service.

Dead-letter and the retry budget

Some messages never succeed. The address is dead, the ISP is down for an hour, or the provider returned a hard 5xx on a malformed payload. You need a bounded retry budget — 3 to 5 attempts with exponential backoff and jitter — and a dead-letter queue (DLQ) for what doesn't fit.

A reasonable DLQ entry looks like this:

{
  "message_id": "msg_01HXYZ...",
  "to": "user@example.com",
  "first_attempt_at": "2026-05-19T11:02:14Z",
  "last_attempt_at": "2026-05-19T11:47:01Z",
  "attempts": 5,
  "last_error": {
    "code": "rate_limited",
    "http_status": 429,
    "provider_message": "Account daily quota exceeded"
  }
}

DLQ entries are not garbage. They are a signal. A 429 burst in the DLQ means your bucket was misconfigured. A flood of 5xx from one recipient domain means that ISP is throttling you specifically. Build a replay tool that re-enqueues DLQ messages after operator review — never a blind auto-replay, which turns a 10-minute incident into a multi-day reputation hit.

When the queue keeps growing

You sized your buckets, turned on back-pressure, and the queue is still climbing. That means offered load exceeds your provider's sustained rate, and no amount of buffering will fix it. Escalate in this order.

Shed load by priority. Tag messages at enqueue: critical (password reset, OTP), high (receipts), normal (notifications), low (digests). When depth crosses a threshold, drop or defer everything below high. A delayed digest is recoverable. A delayed OTP is a support ticket.
Spill to cold storage. Move oldest normal and low messages out of the hot queue into S3 or a relational table with a retry timestamp. A nightly job replays them when the live queue is healthy.
Break the circuit. If rejection rate from the provider crosses 20% for more than a minute, stop sending for 30 seconds. ISPs back off if you back off. Hammering them while throttled makes the throttle longer.
Page the on-call. A queue that has grown for an hour without recovery is not a throughput problem — it is an outage. Either your provider is degraded, IP reputation is collapsing, or a producer is in a runaway loop. None resolve on their own.

A well-designed transactional email rate limit is the first line of defense against all four. Build it once, instrument it well, and the rest of the sending path stays predictable.