The scale of the wager

In 2025, the five largest hyperscalers are expected to invest more than US$300 billion in data‑center construction, graphics processing units (GPUs), memory and power contracts1, with aggregate spending in the period 2025 – 2027 forecasted to pass US$1 trillion. 

Adjusted for inflation, the projected 2025 spend alone is more than the US government spent on its entire 13-year Apollo moon landing program.2

Hyperscaler boards therefore face a familiar strategic dilemma: is this similar to the Internet bubble, where early CapEx overshot demand, or the cloud wave, where capacity was eventually absorbed and richly monetized? Potentially worse, each firm likely fears that underspending cedes a permanent artificial general intelligence (AGI) lead to a rival. The result is a high‑stakes “prisoner’s dilemma. 

A simple check up before another billion

As the graph indicates in 2023 and 20241, hyperscalers allocated over 50 percent of their operating cash flows to capital expenditure (CapEx). Despite heavy investments, returns on compute have become a growing concern.

A simple back‑of‑envelope return‑on‑compute (ROC) check‑up could be something like this:

Net present value (NPV) ≈ Σ [(price per token – variable cost per token) × tokens served × utilization] – (CapEx + fixed operating expenditure (OpEx))

If the NPV stays positive after realistic sensitivity tests, the project clears the investment hurdle. If not, capital is at risk of being stranded in a fast‑depreciating asset class — GPUs typically amortize in 18‑24 months; some high‑bandwidth‑memory (HBM) cards in just twelve.

Here is an overview of some of the critical drivers for each of the ‘swim lanes’ of this formula:

ROC swim‑lanes (critical drivers only)

Swim‑lane (ROC term)

What really moves the needle (2025‑27)

 1. Price per token

 (a) Closed versus open‑source quality gap
 (b) Monetization model — Seat licenses / application programming interface (API) calls / revenue‑share
 (c) Competitive discounting and hyperscaler bundles
 (d) Enterprise shift to on‑premise (on-prem) / private deployments

 2. Variable cost per token

 (a) Electricity US$/kWh
 (b) GPU and high bandwidth memory (HBM) depreciation schedule
 (c) Foundation‑model royalty / revenue‑share
 (d) Data‑center networking costs

 3. Tokens served

 (a) Adoption of agentic copilot workflows
 (b) Average tokens per query (reasoning chains versus single‑shot)
 (c) Physical‑AI workloads (robots, audio-visuals (Avs), industrial Internet of things (IoT))

 4. Utilization rate

 (a) How often GPUs sit idle versus billed
 (b) Bottlenecks (memory, networks, energy etc.)
 (c) Edge/off‑prem inference off‑loading

 5. Fixed outlays (CapEx + OpEx)

 (a) GPU ASP trends
 (b) Data center build cost per megawatt (MW) (land, cooling, fit out)
 (c) Long term power lock-in contracts (e.g., SMR nuclear PPAs)
 (d) Staff and support overhead; software licensing
 

 

The rest of this article explores six of the major variables that can swing the ROC calculation.

Where are we on the compute S-curve?

For fifteen years GPU growth followed a classic first S‑curve: every additional dollar of pre‑training compute delivered impressively higher model quality. The proposed xAI Colossus (100,000 GPUs requiring 300 MW of power) and OpenAI ‘Stargate’ project (500,000 GPUs) assumes this exponential trajectory holds.

Source: Bonus Clouded Judgement - Inference Time Compute (3)

Compute scaling phase

1. Pre-training

2. Post‑training (fine tuning)

3. Test‑time (inference)

 Primary spend

 Massive GPU clusters

 GPU + High Bandwidth Memory (HBM), human‑in‑loop

 Inference accelerators, memory bandwidth

 Bottleneck

 GPU supply

 Data‑labor for human feedback

 Latency and memory (context window etc.)

 Bottleneck

 Access to fresh data

 Fast, low‑cost fine‑tune silicon

 Energy cost per token

But evidence from late‑2024 runs suggests marginal gains from brute‑force pre‑training are flattening or even diminishing.4 Many laboratories are now redirecting spend into the next two overlapping, higher S‑curves: post-training and test-time scaling.

In January 2025, DeepSeek-R1, a Chinese large language model (LLM) was recognized as one of the leading reasoning models, surpassing many US models. The development team achieved this at low CapEx spend through a focus on post-training, which included reinforcement learning with human feedback (RLHF).

Strategic implication: As the era of massive pre-training subsides, ongoing inference for tasks like agents or robotics leads to distributed HPC. Enterprises will also use on-prem solutions to keep data local, and re-use idle GPUs for inference workloads. With this compute shift, high bandwidth memory becomes the critical factor; i.e. the ability to field longer ‘context windows’ at speed.

Monetization and business models

How exactly will these AI capabilities make money? It’s one thing to have a jaw-dropping demo, it’s another to have users or enterprises pay for it sustainably. Several models are being tried:

  • cloud usage fees (e.g. pay per 1,000 API calls)
  • Software-as-a-service (SaaS)-style subscriptions (e.g. ChatGPT Plus at US$20/month)
  • indirect monetization (more engagement leading to more advertising revenue)

The unit economics of AI services can be challenging – serving one AI query can cost 10× or 100× more than a traditional software query – and this cost is increasing with the latest ‘reasoning’ models. To remain viable, either prices must rise or costs fall.

At the same time older model token floor pricing rapidly declines as new models are introduced. As of May 2025, hyperscaler wholesale rates for GPT‑4 Turbo hover around US$5–7 / M‑tokens in committed enterprise deals, down from US$30 just a year earlier.5

Strategic implication: A big question is whether end-users will pay directly for AI, or whether it will mostly be an embedded feature where companies absorb the cost and try to recoup via higher productivity or retention. Enterprises appear to be willing to pay for tangible productivity gains – hence all the AI copilots targeting coding, document generation, etc. However, few consumers pay for search or email now, so will they be willing to pay for an AI assistant? Open AI’s success in growing paid subscriptions suggest some will, but loyalty is fickle — consumers quickly migrate to the new ‘best’ model.

Open-source squeezes token prices

Will open-source models and tools dominate the landscape, or will proprietary offerings maintain an edge? This debate is heated.

When a recent, high-profile model was open‑sourced, few foresaw how quickly community fine‑tunes would close the quality gap with closed-source models such as ChatGPT. January 2025’s launch of a Chinese model went further: frontier‑level accuracy on a cluster costing roughly a tenth of GPT‑4’s, MiLM‑2 followed weeks later.

Cheaper, high‑quality, open-source models place downward pressure on price per token and potentially push value up the stack toward apps / software, data owners, and devices and distribution moats.

Strategic implication: Enterprise buyers’ perceptions could be critical: if companies feel that open-source models are “good enough” for their needs, it could shift spend away from proprietary APIs to more on-prem AI.

Energy moves to center stage

Silicon is no longer the only scarce input. The International Energy Agency (IEA) projects global data center electricity consumption to more than double to around 945 TWh by 2030, with more than half of such growth in the US driven by AI.

Many electricity grids are already under strain. This is compounded by the concentration of data centers in certain locations – i.e. near large population centers. Wait times for critical grid components are extending. Many in the sector believe that power will be the key future bottleneck.

To mitigate these risks, operators have been looking to new energy sources – with some signing two‑decade nuclear, including small‑modular‑reactor (SMR), power purchase agreements to secure 24/7 baseload at predictable cost. For more on this, please refer to our first future forward article on the Electricity Economy.

Strategic implication: For the ROC equation, variable cost token may soon be driven as much by US$/kWh as by chip amortization.

Demand: Agentic and physical AI multiply tokens

Many (Institutions) believe Agentic AI is the next frontier.6 There is already experimentation with systems where multiple specialized AI agents communicate via API calls into multiple software apps, to collaborate to handle complex tasks (one agent might be good at math, another at coding, another at planning — together they solve a problem). This has potential to be more efficient than one monolithic model trying to do everything. If multi-agent approaches prove effective, the infrastructure might shift to orchestrating many smaller models. That would change the profile of compute (more distributed, possibly more memory- and communication-heavy).

Related, inference demand is not linear. Early chatbots averaged 1‑2,000 tokens per call; multi‑step agentic tasks easily consume 50,000+ tokens, while a single autonomous‑vehicle fleet can generate terabytes of daily inference. That swings tokens served and utilization sharply higher.

Strategic implication: Rapid adoption of Agentic AI and physical AI could exponentially increase demand for inference. However, in the case of physical AI, this will need to take place at the edge rather than in the cloud.

Asia Pacific’s parallel supply chain

North America export controls on A100/H100 GPUs7, have accelerated indigenous GPU projects in Asia Pacific and local lithography efforts.

Strategic implication: A dual track market could compress global chip prices. In the short term this is likely to erode past hyperscaler depreciation assumptions. In the longer term it could make AI compute less capital-intensive, boosting return on capital and model access in cost sensitive areas.

Investors see opportunities in real-world AI applications

As AI applications become more integrated into everyday operations, the demand for efficient, real-time, and specialized AI solutions grows. Investors are recognizing the value in startups that not only develop AI models but also focus on optimizing and deploying these models effectively. The surge in venture capital (VC) investments towards post-training and inference technologies underscores a broader industry transition.

Over the past three years, VC investments in AI have increasingly focused on post-training, inference, and fine-tuning technologies. This trend reflects a strategic shift towards deploying and optimizing AI models in real-world applications.

Looking  forward

AI has already set a new high‑water mark for corporate investment, but history should remind us that extraordinary CapEx does not automatically translate into extraordinary returns. Whether the current cycle ends up looking like the profitable cloud build‑out, or the over‑built dot‑com era, may hinge on how management teams navigate six inter‑locking forces that feed directly into the return‑on‑compute (ROC) equation:

Factor

Link to ROC swim‑lane

Why it now matters

 Compute S‑curves (pre‑training → post‑training → test‑time)

 -Tokens served
 ‑ Utilization

Diminishing pre‑training gains force the industry to chase volume‑heavy inference  and memory‑heavy reasoning, shifting demand from GPUs to HBM and smarter scheduling.

 Token economics (pricing versus cost deflation)

 ‑ Price per token
 ‑ Variable cost per token

Open‑source fine‑tunes and usage‑based business models push price/ token down; success will depend on driving cost/ token down even faster.

 Open‑source acceleration

 ‑ Fixed OpEx (royalties)
 ‑ CapEx agility

Community models such as DeepSeek R‑1 prove frontier quality at a fraction of historical budgets, allowing enterprises to redeploy spend higher up the stack.

 Emerging bottlenecks (energy, HBM, latency)

 ‑ Variable cost per token
 ‑ Utilization

Power, memory bandwidth and edge latency — not GPUs —  become the new choke‑points, determining how much of the installed fleet is actually sweated.

 Demand shock from agentic & physical AI

 + Tokens served

Agents, factory robots and autonomous fleets multiply inference load, but only if end‑users perceive clear value and shoulder the bill.

 China’s parallel supply chain

 ‑ Capex
 ‑ Variable cost per token

Export controls have catalyzed a domestic GPU and lithography stack that could cut accelerator ASPs by ~40 %, lowering both upfront build costs and ongoing depreciation.

 

Strategic takeaway: Infrastructure owners that embed a ‘return‑on‑compute’ discipline — optimizing each swim‑lane before committing the next dollar — should translate today’s spend into tomorrow’s cash‑flows. Those that chase capacity for capacity’s sake risk owning stranded, fast‑depreciating assets.



Following the money in AI

Following the money in AI



The electricity economy

Electricity as the new driver of global competitiveness

Our people

Barnaby Robson

Partner, Head of Value Creation, China, Head of Deal Strategy, Hong Kong, Head of Financial Services Deals, Hong Kong

KPMG in China

Javier Rodriguez

Global Head of Strategy

KPMG International

Sanjay Sehgal

Head of AI and Analytics Solutions – Global Advisory, KPMG US

KPMG in the U.S.


1 Capital IQ analysis and datacenterdynamics.com
2 The Planetary Society
3 Bonus Clouded Judgement - Inference Time Compute
4 Centre of future generations
5 https://gptforwork.com/help/billing/pay-per-use-packs/how-it-works
6 https://www.adamsstreetpartners.com/insights/the-next-frontier-the-rise-of-agentic-ai/; CMO Today; Harward Business Review; Wisdom tree
7 Silicon technology powering business