For most of software history, the choice between threads, async, and processes was an implementation detail, a thing you could get wrong and fix or change later. The agent era ends this grace period. When your traffic is generated by software instead of people, the concurrency paradigm stops being a detail and becomes the decision that determines whether the system stands up or falls over. Let's discuss why.
The shift that changes everything: human scale vs. agent scale
Start with the number that reframes the whole problem.
A human using your software is rate-limited by behavior and biology. We read, we think, we click, we wait. A single person generates maybe one or two requests every few seconds, and there is a hard ceiling on how many humans can plausibly use a given system at once.
An agent has no such ceiling. It doesn't read at human speed, doesn't pause to think, and doesn't get tired. Worse, agents arrive in multiples: one workflow can spawn dozens of parallel tool calls, and one deployment can run thousands of those workflows at once. The result is that agent-driven load routinely runs 1,000× to 10,000× the concurrency of the human-driven equivalent, often in synchronized bursts rather than a smooth trickle.
This is not "more of the same traffic." It is a different regime. At human scale, almost any concurrency model works, you have enough headroom that inefficiency is invisible. At agent scale, the inefficiency is the system's behavior. The paradigm you picked is no longer hiding behind spare capacity.
Three words people use interchangeably and shouldn't
Before the argument can land, three terms have to be separated cleanly, because conflating them is the root of most bad architecture decisions.
- Concurrency is about structure: composing a program so that many logical tasks can be in progress at once. It's a property of how you organize work.
- Parallelism is about execution: literally running multiple tasks at the same instant on different CPU cores. It's a property of the hardware actually doing the work.
- Multithreading is one mechanism spawning multiple threads of execution that share memory that you can use to achieve either of the above.
The crucial, liberating insight is that these are independent:
- You can have concurrency without parallelism one core rapidly interleaving thousands of tasks (this is exactly what an async event loop does).
- You can have parallelism without multithreading multiple separate processes, each on its own core.
Most engineers reach for multithreading reflexively when they hear "concurrency," and at agent scale that reflex is expensive. The right question is never "how do I add threads?" It is "do I need concurrency, parallelism, or both and what's the cheapest mechanism that delivers it?"
Why the obvious paradigm breaks first
The intuitive model is one thread per request: a connection arrives, you spawn a thread, it does the work, it exits. This is easy to reason about and it works beautifully, right up until it doesn't.
The reason it dies is not CPU. It's memory, and specifically thread stacks. Each OS thread carries a stack measured in megabytes (often a default of 8 MB). The arithmetic is brutal:
| Concurrent requests | Stack memory (at 8 MB/thread) |
|---|---|
| 1,000 (human scale) | ~8 GB - painful but survivable |
| 1,000,000 (agent scale) | ~8 TB - physically impossible on one machine |
One line of math ends the thread-per-request model for agent workloads. You exhaust RAM on stacks alone long before doing any useful work, and well before you run out of cores.
There's a second, subtler tax even below that ceiling: context switching. A 4-core / 8-thread machine can truly execute only 8 threads at any instant. Pile 1,000 runnable threads onto it, and the kernel spends its time saving and restoring thread state and thrashing CPU caches, burning cycles on bookkeeping instead of your logic. Performance degrades non-linearly: 1,000 threads isn't a little slower than 8; it can be 10–50× slower.
This is the heart of the matter. The paradigm that's easiest to write is the one that fails first under agent load. That inversion, easy-to-write equals first-to-break, is why the choice now demands deliberate thought rather than reflex.
The paradigms, and what each is actually for
Once thread-per-request is off the table, the real menu appears. Each option is a different answer to "concurrency, parallelism, or both?"
- Thread pool (bounded threads): A fixed set of worker threads, say one per core pulling from a queue. Excellent for CPU-bound work because it maps directly onto physical cores with no oversubscription. But it caps concurrency at the pool size: if every worker blocks waiting on I/O, the queue backs up and new work simply waits. Great for computation, poor for massive I/O concurrency.
- Async / event loop: A small number of threads (often one per core) running an event loop that multiplexes thousands of connections via
epoll/kqueue. When a task waits on I/O, it yields and another task runs on the same thread. Memory per task is kilobytes, not megabytes, so a single box handles millions of concurrent connections. This is the paradigm suitable for I/O-bound agent load. Its weakness is the mirror image of the thread pool's: a CPU-heavy task that doesn't yield blocks the entire loop, starving every other connection. "Async all the way down" is a requirement, not a slogan; one blocking call anywhere stalls everything. - Multiprocessing: Separate processes, separate memory, separate everything true parallelism that sidesteps shared-state hazards entirely. The cost is communication: data crossing the process boundary must be serialized and copied, which can dominate for chatty or large-payload workloads. This is the right tool when you need parallel computation with strong isolation.
The pattern across all three: there is no universally best paradigm. There is only the paradigm that matches your workload shape (I/O-bound vs. CPU-bound) and your scaling axis (concurrency vs. parallelism). Choosing well requires knowing which one you actually have.
A concrete illustration: Python's GIL and what "removing it" really buys
Nothing makes the I/O-vs-CPU distinction concrete quite like Python's Global Interpreter Lock (GIL), and the recent free-threaded builds turn it into a clean natural experiment.
The GIL is a single lock ensuring only one thread executes Python bytecode at a time. As of Python 3.14 (October 2025), the free-threaded "no-GIL" build became officially supported, though still optional, with the GIL on by default and a single-threaded performance penalty of roughly 5–10%. The interesting question isn't "is the GIL gone?" It's "which paradigm actually benefits when it is?"
The answer sharpens everything above:
- CPU-bound threading is the big winner. With the GIL, four threads doing heavy computation run at roughly one core's throughput as they queue for the lock. Without it, they run on four cores in genuine parallel, netting a 3–4× speedup on a 4-core box. The paradigm goes from pointless to correct.
- I/O-bound threading barely changes. Threads waiting on network or disk already release the GIL during the wait, so concurrency was never blocked there. Removing the lock gives almost nothing, possibly slightly less, due to the single-threaded tax.
- Async (
asyncio) gains nothing at all. It is single-threaded by design; it never contended for a lock that only one thread was holding. Removing that lock changes nothing. - Multiprocessing keeps working, but free-threaded threading now offers a lighter, shared-memory alternative for the cases that used processes only to dodge the GIL.
The lesson generalizes far beyond Python: a concurrency improvement only helps the paradigm whose actual bottleneck it removes. Free-threading is a CPU-parallelism feature. Pointing it at an I/O-bound async service the very shape most agent traffic takes yields nothing. Knowing which bottleneck you have is the whole game.
Under the abstractions: the kernel doesn't negotiate
Even the right paradigm runs into hard limits imposed by the kernel and the hardware. These are the ceilings the agent era forces you to confront, which human-scale systems rarely or never reach.
- File descriptors: Every connection is a file descriptor, and the historical default limit of 1,024 per process is a relic. Raising it is easy; the deeper issue is how you wait on descriptors.
select()rescans every descriptor on every call an O(n) operation that collapses at scale.epoll/kqueuereturns only the descriptors that are actually ready, O(ready), which is why every high-scale server is built on them. With a million idle connections and five hundred active, epoll wakes you for the five hundred. - Memory per connection: Each connection costs kernel socket buffers, a TCP control block, your application buffers, and often the biggest and most forgotten piece: TLS state. A modest ~56 KB per connection becomes ~56 GB at a million connections, before any application logic runs. This is why pushing a single machine toward ten million connections is fundamentally a memory-tuning exercise.
- CPU cache locality: A main-memory access is ~100× slower than an L1 cache hit, so real systems are memory-latency-bound, not compute-bound. Two forces destroy locality at scale: context switches (which leave the new task's data cold in cache) and false sharing (two cores fighting over the same 64-byte cache line even when touching different variables). The architectural answer is thread-per-core with sharded state: pin a worker to each core, give it its own hot data, and route each connection always to the same core so its state stays warm.
- Kernel scheduler behavior: The scheduler's own overhead grows with the number of runnable threads, thread migration between cores throws away warm caches, and naive wakeups cause "thundering herd" stampedes. The mitigations, CPU pinning,
SO_REUSEPORT,EPOLLEXCLUSIVE, and batched I/O viaio_uringall push in the same direction.
Follow these four constraints and they converge, by necessity rather than fashion, on one shape: async, epoll/io_uring-based, thread-per-core, pinned, with sharded state. Every high-scale system arrives at roughly this architecture because the underlying abstraction leaves no other path. The async paradigm wasn't an aesthetic preference; it's what these limits force on you.
Agents don't just generate more load, they generate different load
Finally, agents misbehave in ways humans don't, and these behaviors interact violently with a poorly chosen paradigm.
- Retry storms: A human who hits an error waits and tries again, maybe. An agent retries immediately, and if a thousand agents all retry a degraded service at once, they amplify the load that caused the degradation, turning a hiccup into a collapse. The defenses (exponential backoff with jitter, circuit breakers) only work if your paradigm can shed load gracefully instead of seizing up.
- Thundering herds: Agents scheduled to act at the top of the hour all wake at exactly the top of the hour. Synchronized bursts hammer systems that would handle the same total volume fine if it were spread out. Jitter on schedules is mandatory.
- Amplification and connection reuse: One agent action can fan out into many downstream calls. Cascading failures propagate fast, and agents make hundreds of requests per connection where humans make one or two. This makes connection reuse (HTTP/2 multiplexing, keep-alive) essential rather than optional.
A paradigm that can absorb a synchronized burst, apply backpressure, and reject excess work cleanly survives these behaviors. One that responds to a burst by spawning a thread per request amplifies them into an outage.
The bottom line
In the human era, the concurrency paradigm was an implementation detail because spare capacity hid your mistakes. In the agent era, there is no spare capacity; the load is 1,000× to 10,000× larger, arrives in synchronized bursts, and retries when it fails.
That changes the status of the decision. The paradigm is no longer how you write the system; it is what determines whether the system works. And the choice is not about finding the "best" paradigm there isn't one. It's about matching the paradigm to two things: the shape of your workload (I/O-bound favors async; CPU-bound favors thread pools or true parallelism) and the physics of the machine (which, at scale, pushes everyone toward async, thread-per-core, and sharded state).
Get that match right and a single modest server absorbs thousands of agents. Get it wrong and no amount of hardware saves you, because you'll hit a memory wall, a context-switching wall, or a cache-locality wall long before you hit a compute wall. In the agent era, that match is the architecture.