The news. On June 24, 2026, OpenAI and Broadcom unveiled Jalapeño, OpenAI's first "Intelligence Processor" — a purpose-built ASIC for LLM inference, not a repurposed training accelerator or a general-purpose AI chip. It pairs a single reticle-sized compute chiplet with HBM (not commodity DRAM) to hold high throughput and low latency together, and was co-designed from first design to tape-out in roughly nine months. Engineering samples are already running production workloads in the lab, including GPT-5.3-Codex-Spark, with early testing reporting performance-per-watt "substantially better" than current state-of-the-art (final numbers still being measured). Initial deployment is targeted for end of 2026. Read the announcement →

Picture a restaurant kitchen that can cook anything on the menu — pastry, grill, soup, all of it. That flexibility is wonderful, and it is exactly what a general-purpose GPU gives you: thousands of programmable cores that will run any parallel workload you throw at them, from training a model to rendering a game. Jalapeño is that kitchen torn down and rebuilt to cook one dish — LLM inference — and nothing else. The bet is that if you only ever cook one dish, a kitchen shaped around that single dish will cook it faster and far more cheaply than the do-everything kitchen ever could.

So what is the "one dish" actually limited by? Here is the part that surprises people: at decode time, the thing slowing the kitchen down is not the chef's hands — it is the cooks walking ingredients in from a far pantry. When a model generates a token, at small batch sizes it must stream the model's weights out of memory and through the compute units once, while doing comparatively little arithmetic per byte read. That makes single-token decode memory-bandwidth-bound — the roofline tips toward memory, and the math units sit mostly idle, waiting on data. The bottleneck the whole chip is fighting is data movement.

Small batch (1–4 requests)Step 1Step 2Step 3
GPU spends most time loading KV cache from memory — waiting, not computing
Large batch (64+ requests)Step 1Step 2Step 3
GPU spends most time computing — data loads amortized across many requests
loading cache (bandwidth) computing (math)

The diagram makes the imbalance concrete: in the bandwidth-bound regime, the pink "moving data" segment dominates and the green "computing" segment is a sliver. Jalapeño's answer is the obvious one once you see the problem — move the pantry next to the stove. It pairs that big compute chiplet with HBM kept physically close, so the costly trip between memory and compute is as short and as fast as the silicon allows. OpenAI says the design was derived from its own measurements of how its models behave at serving time, which is what "co-designed" really means here: the chip is shaped around the bottleneck the company actually observed, not a generic one.

Walk the decode math on a single token (illustrative numbers — OpenAI has not published Jalapeño's figures). Say a model holds 100 GB of weights and the accelerator reads them from memory at 4 TB/s. Generating one token must stream those weights through compute roughly once, so the time is about 100 GB ÷ 4 TB/s = 25 ms — and across that 25 ms the arithmetic units are mostly idle, waiting. Now double the effective memory bandwidth and that 25 ms roughly halves; double the raw compute instead and almost nothing changes. That is the whole reason an inference chip is built around feeding the math units, not stacking more of them — and why the headline metric is performance-per-watt, not peak FLOPs.

None of this means GPUs are going away. The trade Jalapeño makes is real and one-directional: you give up the GPU's ability to train, to switch to a very different kind of workload, to run the whole range of models and tasks a GPU handles. A custom ASIC only pays off when you run one workload at enormous, sustained scale — which is precisely OpenAI's situation, and precisely why a startup serving a thousand requests a day would still reach for a GPU. The interesting signal is not "ASICs beat GPUs"; it is that LLM inference has become a large and stable enough workload to justify burning a chip for it.

ChipBuilt forFlexibilityWhere it wins
General-purpose GPUtraining + inference + any parallel workloadHighestThe default — runs anything, backed by a mature software ecosystem
Repurposed training acceleratortraining, also used to serveHighStrong throughput, but carries training-only hardware that idles during inference
Inference ASIC (Jalapeño)LLM inference only [announcement]LowestBuilt for top performance-per-watt on its one workload at scale (early results); inference-only, far less flexible

Goes deeper in: GPU & CUDA → Roofline Model → The Bottleneck Question

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based