OpenAI’s rate limits terminate async Python requests because the API returns 429 errors faster than your semaphore or queue can throttle outgoing calls. Your code thinks it’s controlling concurrency, but the API rejects requests before your rate limiter even knows they failed.
Pithy Cyborg | AI FAQs – The Details
Question: Why does OpenAI rate limiting break Python async code?
Asked by: Claude Sonnet 4.5
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
Why Rate Limiters Fail With Async Calls
Python’s asyncio lets you fire off hundreds of concurrent requests using asyncio.gather() or asyncio.create_task(). You add a semaphore with asyncio.Semaphore(10) thinking you’re controlling traffic. But semaphores only limit how many coroutines run simultaneously on your machine. They don’t track tokens per minute (TPM) or requests per minute (RPM) at the API level. OpenAI measures rate limits across a 60-second rolling window per organization. If you send 50 requests in 2 seconds (even with semaphore=5), the API sees 50 requests hitting its servers and kills everything over your tier limit. Your local semaphore controlled concurrency, not rate.
The Real Problem Nobody Warns You About
Most FastAPI tutorials show you async def functions with OpenAI calls and zero mention of production rate limit architecture. The official OpenAI Python SDK added automatic retry logic, but it uses exponential backoff that can stall your entire application for minutes during traffic spikes. Worse, if you’re processing user requests in real-time, those retries block your API response times. Users see 30-second timeouts because your background worker is stuck retrying a batch job. The asyncio advantage (non-blocking I/O) disappears when every coroutine waits on the same rate-limited bottleneck. You traded threading complexity for distributed systems problems.sparrow+1
When Semaphores Actually Work
Semaphores prevent rate limit errors only when your limit is lower than the API tier and your time window matches the provider’s measurement period. If you have 10,000 TPM and set Semaphore(5) with 0.5-second delays between calls, you’ll never hit limits because you’re artificially constraining yourself below capacity. This works for small scripts or single-user apps. Some developers combine semaphores with token bucket algorithms that track cumulative usage across the 60-second window. Libraries like aiolimiter implement this, but you’re still guessing at OpenAI’s server-side counter state because their API doesn’t expose current usage in headers.
What This Means For You
- Implement token bucket rate limiting with
aiolimiterthat tracks both RPM and TPM across 60-second windows, not just concurrent request counts. - Add circuit breaker logic that pauses all API calls for 60 seconds after receiving three consecutive 429 errors to let your rate limit window reset.
- Use OpenAI’s batch API for non-time-sensitive workloads because it bypasses rate limits entirely and costs 50% less than real-time endpoints.
- Monitor your
RateLimitErrorexceptions in production and alert when retry backoff exceeds 10 seconds, signaling you need a higher API tier.
Related Questions
- 1
- 2
- 3
Want AI Breakdowns Like This Every Week?
Subscribe to Pithy Cyborg (AI news made simple. No ads. No hype. Just signal.)
Subscribe (Free) → pithycyborg.substack.com
Read archives (Free) → pithycyborg.substack.com/archive
You’re reading Ask Pithy Cyborg. Got a question? Email ask@pithycyborg.com (include your Substack pub URL for a free backlink).
