Yellow Shape

A Quick Guide to AI

Yellow Shape

A Quick Guide to AI

Yellow Shape

A Quick Guide to AI

The cost of compute

AI doesn't run on magic. It runs on electricity and specialised hardware, and needs water, and lots of all three. Understanding these costs helps explain why AI systems work the way they do, why some capabilities are more expensive than others, and why the infrastructure behind AI is becoming a significant economic and environmental issue.

Why AI Is So Compute-Intensive

AI systems, particularly large language models and generative tools, require enormous computational resources for two reasons:

Scale of operations: A single LLM might have hundreds of billions of parameters (the adjustable values that encode patterns from training data). Every time you send a prompt, the system performs calculations across all those parameters to generate a response. Multiply that by millions of users sending billions of requests, and the computational load becomes staggering.

Processing requirements: AI workloads run on specialised chips called GPUs (Graphics Processing Units) that can perform many calculations simultaneously. Training a frontier model requires thousands of these chips running continuously for months. A single high-end GPU like the Nvidia H100 costs upward of $30,000, and training runs can require tens of thousands of them.

The result: training a model like GPT-4 is estimated to have cost over $100 million and consumed around 50 gigawatt-hours of electricity. That's enough to power a city like San Francisco for three days.

Training vs. Inference: Where the Costs Actually Are

There are two phases of AI compute:

Training is the upfront investment: running massive datasets through the model to adjust its parameters. It's expensive but happens once (or periodically when models are updated).

Inference is what happens every time someone uses the model (processing a prompt and generating a response). Each individual inference uses far less compute than training, but it happens constantly, at scale, for potentially millions of users.

Inference typically accounts for 80–90% of the total compute cost over a model's lifetime.

This is why efficiency matters so much. Even small reductions in inference cost, multiplied across billions of queries, translate to significant savings in electricity, hardware, and money.

Which AI Tasks Use the Most Compute?

Not all AI tasks are created equal. The computational demands vary dramatically:

Text generation is relatively efficient, but can vary greatly. A simple query on a smaller, more efficient model, might be around 100-150 joules, about what it takes to ride six feet on an e-bike. But a more complex query on a larger model, could require 5,000-8,000 joules, enough to carry a person about 400 feet on an e-bike.

Image generation typically requires more like the complex text queries. Creating a single high-resolution image uses around 2,000–5,000 joules — that's about 250 feet on an e-bike.

Video generation is the most demanding by far. By some estimates, generating a five-second video clip could consume around 3.4 million joules watt-hours, equivalent to riding 38 miles on an e-bike, or running a microwave for over an hour.

Agentic systems compound these costs. When an AI agent runs in a loop (generating outputs, executing actions, processing results, and iterating), it's making multiple inference calls for a single task. A complex agent workflow might involve dozens of LLM calls, tool executions, and processing steps, each consuming compute.

Data Centres: The Physical Infrastructure

All this computation happens in data centres: specialised facilities that house thousands of servers, cooling systems, and networking equipment.

Where they're located: Data centres cluster in specific regions for practical reasons. Northern Virginia hosts the largest concentration in the world — over 250 facilities handling roughly 70% of global internet traffic. The region became dominant due to proximity to federal government customers, early investment in fibre-optic infrastructure, available land, relatively cheap electricity, and favourable tax incentives. Other major hubs include Dallas, Chicago, Phoenix, and parts of Northern California.

Why they're needed: Every time you use a cloud service, send a message, stream video, or query an AI model, your request travels to a data centre where servers process it. AI has dramatically increased the computational demands on these facilities. A typical AI-focused data centre consumes as much electricity as 100,000 households; the largest ones under development will use twenty times that amount.

Energy consumption: U.S. data centres consumed approximately 183 terawatt-hours of electricity in 2024 — over 4% of total national electricity consumption, roughly equivalent to the annual electricity demand of Pakistan. By 2030, this is projected to grow by 133% to 426 terawatt-hours. Globally, data centres consumed around 415 terawatt-hours in 2024 (about 1.5% of global electricity), projected to reach 945 terawatt-hours by 2030.

Water consumption: Data centres generate enormous heat and require cooling systems that often use significant amounts of water. U.S. data centres directly consumed about 17 billion gallons of water in 2023. A medium-sized data centre can use 110 million gallons annually — equivalent to about 1,000 households. Large facilities can consume up to 5 million gallons per day. This is becoming a significant concern in water-stressed regions.

The Performance-Efficiency Tradeoff

There's an inherent tension in AI development: more capable models generally require more compute.

Larger models with more parameters tend to produce better outputs but require more electricity to run. Longer context windows (the amount of text a model can process at once) improve usefulness but increase memory and processing requirements. More sophisticated reasoning (like having the model "think through" problems step by step) improves accuracy but multiplies inference costs.

AI developers should be constantly navigating this tradeoff. A model that's twice as capable but ten times as expensive to run may not be commercially viable.

Given the significant operating losses of frontier AI models, it can be reasonably assumed that the cost to the consumer is being heavily subsidised at the moment. This is especially relevant for AI companies building applications using frontier model APIs — their supply costs may increase significantly in the coming years.

GPU Access: A Bottleneck

The specialised chips required for AI are in limited supply. Nvidia dominates the market for AI-capable GPUs, and demand far exceeds production capacity. This creates several dynamics:

  • High costs: Cutting-edge GPUs are expensive and often allocated months in advance

  • Concentration of power: Only well-funded organisations can afford the hardware needed to train frontier models

  • Cloud dependence: Many companies rent GPU time from cloud providers (Amazon, Google, Microsoft) rather than building their own infrastructure

  • Geopolitical implications: GPU manufacturing is concentrated in Taiwan, creating supply chain vulnerabilities and geopolitical concerns

For smaller organisations, access to sufficient compute is often the binding constraint on what AI capabilities they can develop or deploy.

Efficiency Techniques: Doing More with Less

Engineers have developed various techniques to reduce compute requirements without proportionally sacrificing performance:

Mixture-of-experts architectures divide a model into specialised sub-networks of smaller language models. Instead of activating the entire model for every query, only the relevant "experts" are engaged. This means a model can have many parameters but only use a fraction of them for any given request, dramatically reducing inference costs.

Quantisation reduces the precision of the numbers used in calculations. Instead of using high-precision formats (like 32-bit floating point), models can run with lower precision (16-bit, 8-bit, or even 4-bit integers). This reduces memory requirements and speeds up processing, sometimes by 2–4x, with relatively modest impact on output quality.

Model distillation trains a smaller "student" model to mimic the outputs of a larger "teacher" model. The student model can often achieve similar performance to the teacher while requiring far less compute to run.

Caching and retrieval stores common outputs and retrieves them rather than regenerating from scratch. If many users ask similar questions, the system can reuse previous responses rather than computing new ones.

These techniques explain why AI capabilities can improve even without proportional increases in compute.

The Bottom Line

AI runs on physical infrastructure with real costs:

  • Specialised hardware that's expensive and supply-constrained

  • Electricity consumption that's growing rapidly

  • Water for cooling systems

  • Data centres that cluster in specific regions and strain local resources

Understanding these costs helps explain why AI services are priced the way they are, why some capabilities (like video generation) are more expensive than others, why efficiency is a major focus of AI development, and why the infrastructure buildout for AI is a significant economic and environmental issue.

The impressive capabilities of modern AI don't emerge from nowhere. They're built on massive investments in compute infrastructure, and that infrastructure has real-world consequences.