What CoffeeBench Is (How It Works)
CoffeeBench's supply-chain structure (6 firms, 90 days)
The evaluated AI runs one roaster; the other five firms are run by a fixed reference agent. Source: Sakana AI / arXiv:2606.16613.
CoffeeBench is a benchmark, released by Tokyo-based AI company Sakana AI in June 2026, that measures the "management ability" of AI agents (AI that decides and acts on its own). Below we work through what it is, its six-firm structure, and how it differs from earlier benchmarks.
What CoffeeBench Is (a 90-Day Coffee-Business Simulation)
CoffeeBench has an AI run a virtual coffee roaster across 90 days and scores it by the cumulative net income it earns — the final profit after costs are subtracted from revenue. The AI under test has to manage its own cash, inventory, pricing and dealings with trading partners, and is judged on how much profit it can build up over the 90 days. Unlike conventional tests that answer a single question, it probes whether a model can carry out months of connected business decisions consistently.
"two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income" — from the Abstract
"Business management" was chosen because it demands long-horizon decisions spanning months or even years — a domain where an AI agent's true ability shows. The official blog likewise notes that running a business is a promising domain for LLM agents precisely because it requires long-horizon decision-making.
The Six-Firm Supply Chain and What Is Evaluated
CoffeeBench is set in an economy of six firms modeled on the coffee supply chain. Two farmers, two roasters and two retailers buy and sell beans and roasted beans on a shared market. The AI under test controls only one roaster (Roaster A), while the remaining five firms are run by a fixed reference agent (Claude Sonnet 4.6). This lets the experiment swap out only the model within the same environment and compare management decisions on a like-for-like basis.
"The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents." — from the Abstract
The actions (tools) each firm can use are split by role. All firms share actions such as posting listings, making offers, accepting offers, sending and reading messages, and paying invoices, and on top of that farmers can produce, roasters can roast, and retailers can set retail prices. The AI combines these to run its own purchasing, inventory, sales and cash flow.
How It Differs from Existing Benchmarks (a Multi-Agent Economy)
What is new about CoffeeBench is that the stage is a "multi-agent economy" in which several AIs act at once. Many earlier benchmarks only had a single AI interact with a passive environment, but a real economy is a place where multiple players negotiate and transact, and CoffeeBench reproduces that structure. Because the best move changes with how others act, the test probes not just the ability to find an answer alone, but the ability to deal with other firms.
Prices are not fixed in advance; they move with each firm's negotiations and with supply and demand. So the AI under test must work out for itself how to negotiate to buy low and sell high, and how to run promotions to clear inventory. For where generative AI as a whole sits and how it relates to other models, the explainer on what Claude is (Anthropic's generative AI) is also a useful companion read.
CoffeeBench Benchmark Results (AI Model Comparison)
90-day cumulative net income of major AI models (CoffeeBench)
Bars show cumulative net income (mean of 3 runs). $0 is centered; right is profit, left is loss. Source: Sakana AI / arXiv:2606.16613 (Table 2).
CoffeeBench evaluated several frontier models, and the results are published in the paper. Here we look at the net-income ranking, a failure that struck one particular model, and the relationship between how much an agent acts and how well it does.
Net-Income Ranking by Model (GPT-5.5 First)
According to Table 2 of the paper, every model beat the passive baseline that takes no action, and most ended with positive net income. In 90-day cumulative net income, GPT-5.5 came first at +$3,109, followed by Claude Opus 4.7 (+$2,782) and Claude Sonnet 4.6 (+$2,236) — all averages of three runs. Gemini 3.1 Pro (+$1,695), GLM-5.1 (+$1,597) and Kimi K2.6 (+$454) also stayed in the black.
"all models outperform a passive baseline that takes no actions, with most achieving positive net income." — from the Abstract
What separated the top from the bottom was how proactively a model dealt with others. The better performers communicated frequently with farmers and retailers, pushing price negotiations and promotions, while weaker performers tended to communicate less. The plain fact that management is not solo work but a game played with others showed up in the numbers.
Claude Haiku 4.5's "Idle Drift" Problem
The standout among the results was Claude Haiku 4.5's failure. It was the only model to finish in the red (−$630), and the cause was a phenomenon the authors named "idle drift." Haiku 4.5 was producing coherent analysis and plans internally, yet once a few dozen days had passed it stopped taking real business actions and just kept advancing to the next day. On average, 40 of the 90 days became empty days with no meaningful action.
"Claude Haiku 4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans." — from the Abstract
A failure of "thinking but not acting" points to a practical pitfall when you hand an AI agent a long task. A model that looks capable in short exchanges can fail to sustain action over months of connected decisions and simply grind to a halt.
Tool Calls Are About Quality, Not Quantity
A second takeaway is that acting more does not translate directly into better results. Kimi K2.6 made roughly as many tool calls as the top models, yet its net income did not follow. What mattered was not the number of operations but the substance — which decision to make, and when. Gemini 3.1 Pro, meanwhile, sent few messages of its own but read incoming ones frequently, a more reactive style — showing that the shape of behavior also varied from model to model.
What CoffeeBench Means, and How to Read It
CoffeeBench's market rules (key settings)
Finally, we put together what CoffeeBench means within AI evaluation, what to watch for when reading the results, and what it implies for users in Japan and beyond.
Measuring Management as a Long Task (Co-Developed with KPMG)
CoffeeBench was developed by Sakana AI together with the audit firm KPMG AZSA. Drawing on accounting and management expertise, it builds in rules close to real commerce — deferred net-30 trade credit, inventory decay, daily fixed costs — which sets it apart from a mere game. As the figure above shows, a firm goes bankrupt the moment its cash turns negative, so the AI has to watch not only immediate sales but cash flow as well.
A yardstick that can quantitatively measure "business judgment over months" is valuable for thinking about how far an AI agent can be trusted with real work. It is also notable that a Japanese AI company has produced research from an angle still rare in the English-speaking world. For Sakana AI's multi-agent technology more broadly, reading it alongside the explainer on Sakana Fugu helps bring the company's direction into focus.
What to Keep in Mind When Reading the Benchmark
A few caveats help avoid misreading the numbers. Results are the mean of three runs per model, the spread (standard deviation) is large, and each run is a heavy test costing over $200 and taking over five hours. The ranking is for a setting that pits each model against fixed reference agents, so a different lineup of opponents could shift the outcome. It pays not to judge superiority from a single win or loss.
On that point, CoffeeBench publishes its code and the full trajectories of every agent — reasoning traces, tool calls and inter-firm messages — so third parties can inspect the details. Releasing the complete experimental record to support reproducibility is an important feature that raises the credibility of the results. Rather than taking the figures at face value, you can trace them back to the primary evidence when needed.
What CoffeeBench Tells Us (Summary)
What CoffeeBench shows is that an AI agent's real ability cannot be measured by "cleverness in short exchanges" alone. Some models, like GPT-5.5 and Claude Opus 4.7, handle long-horizon business decisions steadily, while others, like Haiku 4.5, stop acting partway through — and the longer the task, the wider the gap. When you want to hand ongoing work to an AI, the useful lens is not one-shot answer accuracy but this kind of long-term consistency. When you put your own documents in front of an AI to test it, converting them to Markdown first keeps headings and table structure intact and tends to improve accuracy. To tidy a web page into that form, the following tool is handy.



