What is an inference ASIC like Jalapeño?

An inference ASIC is an Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing — made to run (not train) large language models. OpenAI and Broadcom's Jalapeño, unveiled June 24, 2026, is OpenAI's first such chip: a reticle-sized compute chiplet paired with HBM, co-designed around the data-movement bottleneck of serving models at scale. It gives up a GPU's general-purpose flexibility in exchange for higher performance-per-watt on that single workload (early testing reports substantially better, with final numbers still being measured).

Why build a custom inference chip instead of using GPUs?

At decode time, generating a token is usually memory-bandwidth-bound — the chip spends most of its time moving the model's weights out of memory, not doing arithmetic. A general-purpose GPU pays in silicon and power for flexibility that inference never uses. A chip co-designed around the data-movement bottleneck — a large compute chiplet with HBM kept close — can serve the same tokens at substantially better performance-per-watt in early testing (final numbers still being measured), which at OpenAI's scale materially changes serving cost.

How is Jalapeño different from a GPU?

A GPU is general-purpose: thousands of programmable cores that run training, graphics, and any model. Jalapeño is an ASIC built for LLM inference only — it cannot train and is far less flexible than a general-purpose GPU. That is the trade: it loses the GPU's versatility and gains a shorter, faster path between memory and compute, which is what matters when the bottleneck is data movement rather than raw math. A custom ASIC pays off only when you run one workload at enormous, sustained scale.

OpenAI and Broadcom's Jalapeño, a custom inference ASIC — Inference ASIC vs GPU

TL;DR

What is it: The OpenAI and Broadcom Jalapeño announcement (June 24, 2026) is OpenAI's first custom LLM-inference ASIC — a reticle-sized compute chiplet paired with HBM, built to run models rather than train them. The idea it makes concrete is an inference-optimized ASIC versus a general-purpose GPU.
Why it’s needed: At decode time the bottleneck is usually moving data, not doing math, so a chip co-designed around that movement can serve the same tokens using far less power per token — early testing reports substantially better performance-per-watt (final numbers still being measured), which at OpenAI's scale materially changes serving cost.
vs previous: A general-purpose GPU runs anything — training, graphics, every model — and pays in silicon and power for that flexibility; Jalapeño is hard-wired for inference only, trading the GPU's versatility for a shorter, faster path between memory and compute.

Jargon

ASIC: An Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing. Giving up a general processor's flexibility buys speed and energy efficiency on that job. Jalapeño's job is LLM inference.
HBM: High-Bandwidth Memory — stacked DRAM placed physically very close to the compute die so data reaches the math units faster. It is the same fast memory used on high-end GPUs, and it is where the model actually lives during serving.
Inference vs training: Training builds a model's weights; inference runs the finished weights to generate tokens. They stress hardware differently, so a chip can be excellent at one and unable to do the other. Jalapeño is inference-only.
Memory-bandwidth-bound: When a computation spends most of its time waiting for data to arrive from memory rather than doing arithmetic. Single-token decode is the classic example: lots of bytes read, little math per byte.
Tape-out: The moment a chip design is finished and sent to the fab to be manufactured. Jalapeño went from first design to tape-out in roughly nine months, which OpenAI describes as one of the fastest such cycles to date.
Reticle-sized chiplet: The reticle is the largest area a chip-making machine can pattern in a single exposure (around 800 mm²). A reticle-sized compute chiplet is about as large as one die can physically get — Jalapeño pairs one such tile with HBM.
Performance-per-watt: Useful work (tokens generated) divided by the electrical power it costs. At data-center scale this — not peak speed alone — sets the bill, which is why a custom inference chip targets it directly.

The news. On June 24, 2026, OpenAI and Broadcom unveiled Jalapeño, OpenAI's first "Intelligence Processor" — a purpose-built ASIC for LLM inference, not a repurposed training accelerator or a general-purpose AI chip. It pairs a single reticle-sized compute chiplet with HBM (not commodity DRAM) to hold high throughput and low latency together, and was co-designed from first design to tape-out in roughly nine months. Engineering samples are already running production workloads in the lab, including GPT-5.3-Codex-Spark, with early testing reporting performance-per-watt "substantially better" than current state-of-the-art (final numbers still being measured). Initial deployment is targeted for end of 2026. Read the announcement →

Picture a restaurant kitchen that can cook anything on the menu — pastry, grill, soup, all of it. That flexibility is wonderful, and it is exactly what a general-purpose GPU gives you: thousands of programmable cores that will run any parallel workload you throw at them, from training a model to rendering a game. Jalapeño is that kitchen torn down and rebuilt to cook one dish — LLM inference — and nothing else. The bet is that if you only ever cook one dish, a kitchen shaped around that single dish will cook it faster and far more cheaply than the do-everything kitchen ever could.

So what is the "one dish" actually limited by? Here is the part that surprises people: at decode time, the thing slowing the kitchen down is not the chef's hands — it is the cooks walking ingredients in from a far pantry. When a model generates a token, at small batch sizes it must stream the model's weights out of memory and through the compute units once, while doing comparatively little arithmetic per byte read. That makes single-token decode memory-bandwidth-bound — the roofline tips toward memory, and the math units sit mostly idle, waiting on data. The bottleneck the whole chip is fighting is data movement.

The diagram makes the imbalance concrete: in the bandwidth-bound regime, the pink "moving data" segment dominates and the green "computing" segment is a sliver. Jalapeño's answer is the obvious one once you see the problem — move the pantry next to the stove. It pairs that big compute chiplet with HBM kept physically close, so the costly trip between memory and compute is as short and as fast as the silicon allows. OpenAI says the design was derived from its own measurements of how its models behave at serving time, which is what "co-designed" really means here: the chip is shaped around the bottleneck the company actually observed, not a generic one.

Walk the decode math on a single token (illustrative numbers — OpenAI has not published Jalapeño's figures). Say a model holds 100 GB of weights and the accelerator reads them from memory at 4 TB/s. Generating one token must stream those weights through compute roughly once, so the time is about 100 GB ÷ 4 TB/s = 25 ms — and across that 25 ms the arithmetic units are mostly idle, waiting. Now double the effective memory bandwidth and that 25 ms roughly halves; double the raw compute instead and almost nothing changes. That is the whole reason an inference chip is built around feeding the math units, not stacking more of them — and why the headline metric is performance-per-watt, not peak FLOPs.

None of this means GPUs are going away. The trade Jalapeño makes is real and one-directional: you give up the GPU's ability to train, to switch to a very different kind of workload, to run the whole range of models and tasks a GPU handles. A custom ASIC only pays off when you run one workload at enormous, sustained scale — which is precisely OpenAI's situation, and precisely why a startup serving a thousand requests a day would still reach for a GPU. The interesting signal is not "ASICs beat GPUs"; it is that LLM inference has become a large and stable enough workload to justify burning a chip for it.

Chip	Built for	Flexibility	Where it wins
General-purpose GPU	training + inference + any parallel workload	Highest	The default — runs anything, backed by a mature software ecosystem
Repurposed training accelerator	training, also used to serve	High	Strong throughput, but carries training-only hardware that idles during inference
Inference ASIC (Jalapeño)	LLM inference only [announcement]	Lowest	Built for top performance-per-watt on its one workload at scale (early results); inference-only, far less flexible

Goes deeper in: GPU & CUDA → Roofline Model → The Bottleneck Question

Related explainers

NVIDIA AI factories — tokens per megawatt — frames serving as a performance-per-watt problem at the datacenter scale Jalapeño is built to win.
AMD Atom — prefill/decode disaggregation — another hardware answer to the fact that prefill and decode stress the chip in opposite ways.
Blackwell on MLPerf 6.0 — strong scaling — the general-purpose GPU side of the same inference-efficiency race.
Jetson Thor — edge Blackwell — purpose-built inference silicon at the opposite end of the scale, the edge.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based