Dylan Patel on the 100x hiding in hardware-software co-design
SemiAnalysis founder Dylan Patel argues the biggest AI gains no longer come from faster chips alone — they come from co-designing models, kernels, and silicon together, turning three stacked 2x wins into a single 100x, and reshaping the NVIDIA-vs-TPU, CUDA-moat, and compute-crunch debates in the process.
The 100x Hides Between the Layers
The biggest AI gains no longer come from any single layer but from co-designing model, kernels, and silicon together — three stacked 2x wins multiply to 8x, yet co-designed across the stack they compound to 100x.
instead of being multiplicative to 8x, it's actually 100x because you've optimized across all three layers
Everything Is Downstream of That Curve
Serving reduces to one tradeoff — throughput (users batched) versus interactivity (tokens per second per user) — and every infrastructure, model, and product decision is just a choice of where to sit on that curve.
The throughput interactivity curve is the most important one.
Benchmarks Must Live or They Lie
Point-in-time benchmarks are stale on arrival — inference software ships about twice a week and cost drops ~60x a year — so the only honest benchmark re-runs every day on the latest hardware and models.
you can't have point in time benchmarking. You need to have benchmarks be living and breathing
NVIDIA vs TPU Is the Wrong Question
No chip wins in isolation — the winner is whichever silicon the model was co-designed for, because different matrix-multiply units and network topologies make the same model great on one and terrible on the other.
the NVLink can only connect 72 GPUs. for Google, their ICI can connect 8,000 chips at super high bandwidth, but you have to pass through other chips to get there because there's no switch
The CUDA Moat Was Never CUDA
Models write kernels well enough that CUDA-as-software is commoditizing, so the real lock-in is that popular open weights are co-designed for GPU shapes — the moat is the models, not the language.
models are just great at coding and all software gets commoditized
Memory: The Bottleneck Frozen in Time
Compute keeps leaping while the memory cell itself hasn't fundamentally changed in decades — DRAM is ~40 years old, NAND ~25 — and the next unlock is stacking memory directly onto the compute die.
you stack the memory directly on the chip and that makes your bandwidth explode
Every Specialized Chip Risks a Local Minimum
A custom ASIC can be optimal for today's model yet stranded the moment the architecture changes, so because labs don't even know their own architecture a year out, general-purpose silicon keeps an option value specialization can't.
some people will race to a local minima and then the question is like how do you leap how do you scoot back over to like the absolute minima
The Compute Crunch Is Demand Outrunning Supply
The crunch persists as long as the useful work models can do grows faster than compute does — and while per-token margins stay high, every rented GPU pays for itself, so the buildout keeps accelerating.
if the work that these models can do does not expand faster than the compute capacity then that tide turns