AI’s Memory Game: The Key to Cheaper, Faster Models

Running AI models is starting to feel a lot like a game of memory. When people talk about the cost of AI infrastructure, the conversation usually centers on Nvidia GPUs — and for good reason. But increasingly, memory itself is becoming just as critical. In fact, as hyperscale companies prepare to pour billions into new data centers, the price of DRAM chips has jumped roughly sevenfold over the past year.

At the same time, there’s a growing focus on how memory is orchestrated. Making sure the right data reaches the right AI agent at the right time is becoming an art — and a science. Companies that master this can handle the same queries with fewer tokens. In a space where efficiency often makes the difference between profit and loss, that’s a huge competitive advantage.

Semiconductor analyst Doug O’Laughlin dives into this topic on his Substack, chatting with Val Bercovici, chief AI officer at Weka. Both come from a semiconductor background, so the discussion leans heavily into the hardware side of things. But the implications for AI software are massive.

One particularly striking insight comes from Bercovici’s take on Anthropic’s prompt-caching documentation. He notes that it has become surprisingly complex:

“The tell is if we go to Anthropic’s prompt caching pricing page. It started off as a very simple page six or seven months ago, especially as Claude Code was launching — just ‘use caching, it’s cheaper.’ Now it’s an encyclopedia of advice on exactly how many cache writes to pre-buy. You’ve got 5-minute tiers, which are very common across the industry, or 1-hour tiers — and nothing above. That’s a really important tell. Then of course you’ve got all sorts of arbitrage opportunities around the pricing for cache reads based on how many cache writes you’ve pre-purchased.”

The main idea here is that Claude’s model only holds your prompt in cached memory for a limited time. You can pay for a short 5-minute window or a longer 1-hour window. Accessing data that’s still in the cache is far cheaper, so smart memory management can save a lot of money. But there’s a catch: every new piece of data added to a query can bump something else out of the cache.

This might sound technical, but the takeaway is simple: memory management in AI is becoming a major differentiator. Companies that get it right are likely to pull ahead.

There’s still plenty of room to innovate. Back in October, I wrote about a startup called Tensormesh, which is tackling cache optimization — one layer in the larger memory stack. But opportunities exist throughout the system. Lower in the stack, for example, engineers are figuring out when to use DRAM versus HBM in data centers, a decision that affects speed and cost. Higher up, end users are experimenting with how to structure multiple AI models (or “swarms”) to make the most of shared memory.

The benefits are clear: better memory orchestration means fewer tokens used, cheaper inference, and more efficient AI. At the same time, model architectures themselves are improving at processing tokens, further driving costs down. As server costs drop, applications that currently seem unprofitable could soon become viable — opening the door to new AI-driven businesses and innovations.