A multi-agent economic benchmark from Sakana AI and KPMG AZSA that has an AI agent run a virtual coffee roaster for 90 days and scores it by cumulative net income. Two farmers, two roasters and two retailers — six firms in all — trade with one another.

Why did Claude Haiku 4.5 end up in the red?

Because of a failure the authors named idle drift. Although it produced coherent analysis and plans internally, partway through the run it stopped taking real business actions and only advanced to the next day, leaving it as the only model with a loss (−$630).

What Is CoffeeBench? Sakana AI's AI Management Benchmark

Q: Which AI model performed best on CoffeeBench?

In Table 2 of the paper, GPT-5.5 led with a 90-day cumulative net income of +$3,109, followed by Claude Opus 4.7 (+$2,782) and Claude Sonnet 4.6 (+$2,236), each averaged over three runs. The passive baseline that takes no action ran a loss.

Article summary

CoffeeBench is a multi-agent economic benchmark that has an AI run a virtual coffee roaster for 90 days and scores it by cumulative net income. Co-developed by Sakana AI and the audit firm KPMG AZSA.
Unlike the usual "single AI vs. a passive environment," six firms — two farmers, two roasters, two retailers — communicate, negotiate and trade while competing for profit.
The results put GPT-5.5 first with a net income of +$3,109, followed by Claude Opus 4.7 (+$2,782) and Claude Sonnet 4.6 (+$2,236), each averaged over three runs.
Higher-performing models communicated more actively with others to negotiate prices and run promotions; for tool calls, quality mattered more than quantity.
Claude Haiku 4.5 suffered "idle drift," stopping real actions partway through and ending as the only model in the red (−$630).
A single run costs over $200 and takes over five hours, and the code plus full agent trajectories are public.
It is an attempt to measure months-long business decision-making with AI, and notable as original research from a Japanese AI company.

What CoffeeBench Is (How It Works)

CoffeeBench's supply-chain structure (6 firms, 90 days)

The evaluated AI runs one roaster; the other five firms are run by a fixed reference agent. Source: Sakana AI / arXiv:2606.16613.

Farmers ×2　produce coffee beans (produce_item)

↓　sell beans

Roasters ×2　buy and roast beans, then sell wholesale (roast) / one is run by the evaluated model

↓　sell roasted beans

Retailers ×2　sell to consumers and set prices (set_retail_price)

↓　sell products

Consumers　simulated demand makes purchases

CoffeeBench is a benchmark, released by Tokyo-based AI company Sakana AI in June 2026, that measures the "management ability" of AI agents (AI that decides and acts on its own). Below we work through what it is, its six-firm structure, and how it differs from earlier benchmarks.

What CoffeeBench Is (a 90-Day Coffee-Business Simulation)

CoffeeBench has an AI run a virtual coffee roaster across 90 days and scores it by the cumulative net income it earns — the final profit after costs are subtracted from revenue. The AI under test has to manage its own cash, inventory, pricing and dealings with trading partners, and is judged on how much profit it can build up over the 90 days. Unlike conventional tests that answer a single question, it probes whether a model can carry out months of connected business decisions consistently.

CoffeeBench paper (arXiv:2606.16613)View official source →

"two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income" — from the Abstract

"Business management" was chosen because it demands long-horizon decisions spanning months or even years — a domain where an AI agent's true ability shows. The official blog likewise notes that running a business is a promising domain for LLM agents precisely because it requires long-horizon decision-making.

The Six-Firm Supply Chain and What Is Evaluated

CoffeeBench is set in an economy of six firms modeled on the coffee supply chain. Two farmers, two roasters and two retailers buy and sell beans and roasted beans on a shared market. The AI under test controls only one roaster (Roaster A), while the remaining five firms are run by a fixed reference agent (Claude Sonnet 4.6). This lets the experiment swap out only the model within the same environment and compare management decisions on a like-for-like basis.

CoffeeBench paper (arXiv:2606.16613)View official source →

"The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents." — from the Abstract

The actions (tools) each firm can use are split by role. All firms share actions such as posting listings, making offers, accepting offers, sending and reading messages, and paying invoices, and on top of that farmers can produce, roasters can roast, and retailers can set retail prices. The AI combines these to run its own purchasing, inventory, sales and cash flow.

How It Differs from Existing Benchmarks (a Multi-Agent Economy)

What is new about CoffeeBench is that the stage is a "multi-agent economy" in which several AIs act at once. Many earlier benchmarks only had a single AI interact with a passive environment, but a real economy is a place where multiple players negotiate and transact, and CoffeeBench reproduces that structure. Because the best move changes with how others act, the test probes not just the ability to find an answer alone, but the ability to deal with other firms.

Prices are not fixed in advance; they move with each firm's negotiations and with supply and demand. So the AI under test must work out for itself how to negotiate to buy low and sell high, and how to run promotions to clear inventory. For where generative AI as a whole sits and how it relates to other models, the explainer on what Claude is (Anthropic's generative AI) is also a useful companion read.

CoffeeBench Benchmark Results (AI Model Comparison)

90-day cumulative net income of major AI models (CoffeeBench)

Bars show cumulative net income (mean of 3 runs). $0 is centered; right is profit, left is loss. Source: Sakana AI / arXiv:2606.16613 (Table 2).

GPT-5.5+$3,109

Claude Opus 4.7+$2,782

Claude Sonnet 4.6+$2,236

Gemini 3.1 Pro+$1,695

GLM-5.1+$1,597

Kimi K2.6+$454

Claude Haiku 4.5−$630

Passive baseline−$2,765

CoffeeBench evaluated several frontier models, and the results are published in the paper. Here we look at the net-income ranking, a failure that struck one particular model, and the relationship between how much an agent acts and how well it does.

Net-Income Ranking by Model (GPT-5.5 First)

According to Table 2 of the paper, every model beat the passive baseline that takes no action, and most ended with positive net income. In 90-day cumulative net income, GPT-5.5 came first at +$3,109, followed by Claude Opus 4.7 (+$2,782) and Claude Sonnet 4.6 (+$2,236) — all averages of three runs. Gemini 3.1 Pro (+$1,695), GLM-5.1 (+$1,597) and Kimi K2.6 (+$454) also stayed in the black.

CoffeeBench paper (arXiv:2606.16613)View official source →

"all models outperform a passive baseline that takes no actions, with most achieving positive net income." — from the Abstract

What separated the top from the bottom was how proactively a model dealt with others. The better performers communicated frequently with farmers and retailers, pushing price negotiations and promotions, while weaker performers tended to communicate less. The plain fact that management is not solo work but a game played with others showed up in the numbers.

Claude Haiku 4.5's "Idle Drift" Problem

The standout among the results was Claude Haiku 4.5's failure. It was the only model to finish in the red (−$630), and the cause was a phenomenon the authors named "idle drift." Haiku 4.5 was producing coherent analysis and plans internally, yet once a few dozen days had passed it stopped taking real business actions and just kept advancing to the next day. On average, 40 of the 90 days became empty days with no meaningful action.

CoffeeBench paper (arXiv:2606.16613)View official source →

"Claude Haiku 4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans." — from the Abstract

A failure of "thinking but not acting" points to a practical pitfall when you hand an AI agent a long task. A model that looks capable in short exchanges can fail to sustain action over months of connected decisions and simply grind to a halt.

Tool Calls Are About Quality, Not Quantity

A second takeaway is that acting more does not translate directly into better results. Kimi K2.6 made roughly as many tool calls as the top models, yet its net income did not follow. What mattered was not the number of operations but the substance — which decision to make, and when. Gemini 3.1 Pro, meanwhile, sent few messages of its own but read incoming ones frequently, a more reactive style — showing that the shape of behavior also varied from model to model.

Free ToolURL to Markdown ConverterConvert any public web page URL to Markdown. Preserves headings, tables, lists, and links — perfect for LLM and RAG preprocessing, research notes, and archiving web articles.Try it now →

What CoffeeBench Means, and How to Read It

CoffeeBench's market rules (key settings)

Payment terms

Net-30 trade credit (paid 30 days later); a 0.1% daily penalty for late payment

Inventory decay

Beans and inventory spoil by 0.5% per day

Fixed daily cost

Farmers $25 / Roasters $30 / Retailers $50

Bankruptcy

Triggered the moment the cash balance goes negative

Metric

Cumulative net income over 90 days

Finally, we put together what CoffeeBench means within AI evaluation, what to watch for when reading the results, and what it implies for users in Japan and beyond.

Measuring Management as a Long Task (Co-Developed with KPMG)

CoffeeBench was developed by Sakana AI together with the audit firm KPMG AZSA. Drawing on accounting and management expertise, it builds in rules close to real commerce — deferred net-30 trade credit, inventory decay, daily fixed costs — which sets it apart from a mere game. As the figure above shows, a firm goes bankrupt the moment its cash turns negative, so the AI has to watch not only immediate sales but cash flow as well.

A yardstick that can quantitatively measure "business judgment over months" is valuable for thinking about how far an AI agent can be trusted with real work. It is also notable that a Japanese AI company has produced research from an angle still rare in the English-speaking world. For Sakana AI's multi-agent technology more broadly, reading it alongside the explainer on Sakana Fugu helps bring the company's direction into focus.

What to Keep in Mind When Reading the Benchmark

A few caveats help avoid misreading the numbers. Results are the mean of three runs per model, the spread (standard deviation) is large, and each run is a heavy test costing over $200 and taking over five hours. The ranking is for a setting that pits each model against fixed reference agents, so a different lineup of opponents could shift the outcome. It pays not to judge superiority from a single win or loss.

On that point, CoffeeBench publishes its code and the full trajectories of every agent — reasoning traces, tool calls and inter-firm messages — so third parties can inspect the details. Releasing the complete experimental record to support reproducibility is an important feature that raises the credibility of the results. Rather than taking the figures at face value, you can trace them back to the primary evidence when needed.

What CoffeeBench Tells Us (Summary)

What CoffeeBench shows is that an AI agent's real ability cannot be measured by "cleverness in short exchanges" alone. Some models, like GPT-5.5 and Claude Opus 4.7, handle long-horizon business decisions steadily, while others, like Haiku 4.5, stop acting partway through — and the longer the task, the wider the gap. When you want to hand ongoing work to an AI, the useful lens is not one-shot answer accuracy but this kind of long-term consistency. When you put your own documents in front of an AI to test it, converting them to Markdown first keeps headings and table structure intact and tends to improve accuracy. To tidy a web page into that form, the following tool is handy.

What Is CoffeeBench? Sakana AI's AI Management Benchmark

What CoffeeBench Is (How It Works)

What CoffeeBench Is (a 90-Day Coffee-Business Simulation)

The Six-Firm Supply Chain and What Is Evaluated

How It Differs from Existing Benchmarks (a Multi-Agent Economy)

CoffeeBench Benchmark Results (AI Model Comparison)

Net-Income Ranking by Model (GPT-5.5 First)

Claude Haiku 4.5's "Idle Drift" Problem

Tool Calls Are About Quality, Not Quantity

What CoffeeBench Means, and How to Read It

Measuring Management as a Long Task (Co-Developed with KPMG)

What to Keep in Mind When Reading the Benchmark

What CoffeeBench Tells Us (Summary)

FAQ

Related Tools

Related Tool Categories

Articles

What Is Claude Mythos 5? Why It's Back for 100+ Critical Orgs

What Is the Colorado AI Act? Why the First U.S. High-Risk AI Law Was Repealed Before Taking Effect

What Is OpenAI Jalapeño? Broadcom's Custom Inference Chip