Sequoia Capital

Dylan Patel on the 100x hiding in hardware-software co-design

Dylan Patel· Founder & CEO of SemiAnalysis at SemiAnalysis
·~70 min·English·Sequoia Capital
GPUAI InfrastructureLLM
TL;DR

SemiAnalysis founder Dylan Patel argues the biggest AI gains no longer come from faster chips alone — they come from co-designing models, kernels, and silicon together, turning three stacked 2x wins into a single 100x, and reshaping the NVIDIA-vs-TPU, CUDA-moat, and compute-crunch debates in the process.

01Core Mental Model

The 100x Hides Between the Layers

The biggest AI gains no longer come from any single layer but from co-designing model, kernels, and silicon together — three stacked 2x wins multiply to 8x, yet co-designed across the stack they compound to 100x.

instead of being multiplicative to 8x, it's actually 100x because you've optimized across all three layers

Dylan Patel, SemiAnalysis
Key Insight
The claim is quietly deflationary about hardware alone. If the frontier really lived in silicon, the Hopper-to-Blackwell jump (~30x on optimized DeepSeek) would dominate. Instead Patel puts the biggest single share of recent gains on the model layer, and the real leverage on the seams between layers — which is exactly where a pure-play chipmaker or a pure-play modeling team can't reach.

02The One Curve

Everything Is Downstream of That Curve

Serving reduces to one tradeoff — throughput (users batched) versus interactivity (tokens per second per user) — and every infrastructure, model, and product decision is just a choice of where to sit on that curve.

The throughput interactivity curve is the most important one.

Dylan Patel, SemiAnalysis
Key Insight
This reframes 'is my serving fast or cheap?' as a position, not a verdict. The same hardware can deliver ~1000 tokens/sec spread across a big batch or ~250 tokens/sec to one user — a 4x cost swing with no hardware change. It is why fast-mode products (Claude Code fast mode, priority queues) can rationally charge multiples: they are selling a point on the curve, not a different chip.

03Measurement

Benchmarks Must Live or They Lie

Point-in-time benchmarks are stale on arrival — inference software ships about twice a week and cost drops ~60x a year — so the only honest benchmark re-runs every day on the latest hardware and models.

you can't have point in time benchmarking. You need to have benchmarks be living and breathing

Dylan Patel, SemiAnalysis
Key Insight
A living benchmark is also a moat that looks like a public good. Getting every major cloud to donate $50M+ of hardware — about 15 chip types running daily, fully-public Pareto curves, and Patel expects $100M+ once TPUs and Trainium are added — only works if the measurer is already trusted enough that no one wants to be caught looking slow. The neutrality is the business model.

04Silicon

NVIDIA vs TPU Is the Wrong Question

No chip wins in isolation — the winner is whichever silicon the model was co-designed for, because different matrix-multiply units and network topologies make the same model great on one and terrible on the other.

the NVLink can only connect 72 GPUs. for Google, their ICI can connect 8,000 chips at super high bandwidth, but you have to pass through other chips to get there because there's no switch

Dylan Patel, SemiAnalysis
Key Insight
Because the matmul unit shape and network topology feed back up into the attention mechanism and expert layout, you literally cannot benchmark the chips in isolation — the model is different on each. That is why Patel says OpenAI's sparse models would be a bad fit for TPUs while Anthropic and Google's models would be a bad fit training on GPUs. The chip debate is downstream of the model.

05Moats

The CUDA Moat Was Never CUDA

Models write kernels well enough that CUDA-as-software is commoditizing, so the real lock-in is that popular open weights are co-designed for GPU shapes — the moat is the models, not the language.

models are just great at coding and all software gets commoditized

Dylan Patel, SemiAnalysis
Key Insight
This quietly inverts a decade of NVIDIA-moat narrative. If the lock-in is co-designed open weights rather than the CUDA language, then the counter-move is symmetric and cheap: Google shipping strong Gemma models would make those run poorly on GPUs and pull inference toward TPUs. The moat becomes a race to seed the most-adopted open weights, not to own a programming model.

06Bottleneck

Memory: The Bottleneck Frozen in Time

Compute keeps leaping while the memory cell itself hasn't fundamentally changed in decades — DRAM is ~40 years old, NAND ~25 — and the next unlock is stacking memory directly onto the compute die.

you stack the memory directly on the chip and that makes your bandwidth explode

Dylan Patel, SemiAnalysis
Key Insight
This is the roofline model showing up in silicon history. Decades of compute scaling against a near-static memory cell is precisely why so much of modern serving is memory-bandwidth-bound, not compute-bound — and why the interesting fixes (memory-on-die, and pushing past the ~1 watt/mm-squared power ceiling) target the roof, not the ceiling. Faster FLOPs stopped being the constraint a while ago.

07Chip Strategy

Every Specialized Chip Risks a Local Minimum

A custom ASIC can be optimal for today's model yet stranded the moment the architecture changes, so because labs don't even know their own architecture a year out, general-purpose silicon keeps an option value specialization can't.

some people will race to a local minima and then the question is like how do you leap how do you scoot back over to like the absolute minima

Dylan Patel, SemiAnalysis
Key Insight
General-purpose silicon is really a bet on architectural uncertainty. Patel's tell is that labs 'literally don't know what architecture they're going to be doing in a year' — so a five-year ASIC commitment is a wager that the model won't move. That option value is why even Google, the deepest TPU shop, runs three different TPU design programs — Broadcom, MediaTek, and an undisclosed third, each a different architecture — and still rents GPUs for non-Gemini workloads.

08Economics

The Compute Crunch Is Demand Outrunning Supply

The crunch persists as long as the useful work models can do grows faster than compute does — and while per-token margins stay high, every rented GPU pays for itself, so the buildout keeps accelerating.

if the work that these models can do does not expand faster than the compute capacity then that tide turns

Dylan Patel, SemiAnalysis
Key Insight
This gives the buildout a falsifiable kill-switch. The whole levered, high-growth compute economy holds only while model progress keeps expanding the work AI can profitably do faster than gigawatts come online. The day model progress stalls, the tide 'turns' — which is why Patel keeps probing whether anyone at the labs sees a ceiling, and why they all insist they don't.