KV Cache
A memory-speed tradeoff in transformer inference: store previous keys and values to generate faster, while paying in memory.
Short notes on technical concepts, mental models, and engineering principles that shape how I think and build.
A memory-speed tradeoff in transformer inference: store previous keys and values to generate faster, while paying in memory.
Use a small draft model to guess tokens and a larger model to verify them. The idea is simple: spend less time waiting on the expensive model.
Longer context is not the same as better memory. The useful question is what the model can actually retrieve, weigh, and use.
AI systems change as prompts, users, models, and data change. Evaluation has to be continuous, not a one-time report.
Batching improves hardware utilization, but it also changes latency, queueing, and cost. Empty seats on the bus still cost money.
A practical way to reason about whether a workload is limited by compute or memory bandwidth.
A reminder that noisy individual measurements can still produce stable aggregate behavior.
When direct sampling is hard, build a process that wanders through the right distribution over time.
A measure of uncertainty, disorder, and surprise. Useful in information theory, thermodynamics, and everyday life.
Strip a problem down to what must be true, then rebuild from there instead of inheriting old assumptions.
Choose predictable tools until the problem clearly earns something novel. Novelty should pay rent.
Good systems are often built from narrow contracts. The smaller the surface area, the easier it is to reason about change.
If you cannot see what the system is doing, clever fixes are mostly guesses.
Documentation, tests, automation, and clean abstractions compound quietly. They make later work cheaper.
The cost of a choice is the best alternative you gave up, not just the money or time you spent.
Ask whether the next hour, next dollar, or next unit is worth it from where you are now.
Instead of asking how to succeed, ask what would guarantee failure and avoid that first.
When uncertainty is high, make the next experiment cheap enough that learning is the main outcome.