How do other AI systems work?

Large Language Models power chatbots, but AI extends far beyond conversational interfaces. This page covers the broader landscape of AI systems you're likely to encounter.

Applications of Chatbots

The chatbot interface is just one way to use an LLM. Behind the scenes, many AI companies simply send data to ChatGPT or Claude through APIs (Application Programming Interfaces) to power a wide range of applications. This is the product of the vast majority of AI companies receiving VC funding.

Document Summarisation Feed a long report into an LLM via API, and it returns a condensed version. The model isn't "reading" in any human sense. It's processing tokens and generating a statistically probable shorter version that preserves key information. Human engineers define the prompts, length constraints, and quality checks.

Translation LLMs can convert text between languages because their training data included parallel texts in multiple languages. The model has encoded statistical relationships between phrases across languages. Quality varies significantly depending on the language pair and domain.

Content Generation Marketing copy, email drafts, product descriptions, code snippets — organisations that create these typically use API calls to ChatGPT or Claude to create these.

Classification and Extraction Rather than generating text, LLMs can categorize inputs ("Is this email a complaint or a compliment?") or extract structured data from unstructured text ("Pull the date, amount, and vendor from this invoice"). Engineers constrain outputs to specific formats by adding standard text to the prompt, e.g., “This is the request from the user, please format the date in the format MM/DD/YYYY”).

Predictive and Classification AI Systems

Many high-impact AI systems don't generate content at all, they take inputs and produce classifications, scores, or predictions.

How They Work

These models are trained on historical data where the outcomes are already known. For example, a fraud detection system might be trained on millions of past transactions that were later confirmed as legitimate or fraudulent. During training, the system's parameters are adjusted to minimise prediction errors — getting better at distinguishing fraud from legitimate transactions based on patterns in the data.

Once trained, the system can score new transactions it hasn't seen before, flagging those that match patterns associated with past fraud.

Clearly, this approach leads to much more accurate results than asking a chatbot to make predictions, since the chatbot’s training data likely has very little accurate data.

Common Applications

Credit scoring: input financial history, output a risk score
Medical diagnosis support: input symptoms and test results, output probability of conditions
Fraud detection: input transaction details, output likelihood of fraud
Demand forecasting: input historical sales data, output predicted future demand
Insurance pricing: input applicant details, output risk assessment
Hiring screening: input résumé data, output candidate ranking

The Stakes

These systems often make or influence consequential decisions about people's lives — who gets a loan, who gets flagged for additional security screening, whose résumé gets seen by a human. This makes the training data and optimisation targets especially important: if historical data reflects past discrimination, the system may perpetuate it. Human oversight of such systems is critical.

Computer Vision

Beyond generating images, AI systems can analyse them:

Object Detection and Recognition Systems can identify and locate objects in images: finding all the cars in a photo, reading text from documents, or identifying products on store shelves. These are trained by showing the system millions of labeled examples: images where humans have drawn boxes around objects and tagged them — this is often difficult, manual work, performed by underpaid workers in the Global South. This is also done by you, in every captcha you fill out.

Facial Recognition These systems match faces to identities, powering phone unlocking, photo organisation, and security applications. They work by extracting numerical representations of facial features and comparing them to stored templates. Accuracy varies significantly across demographic groups, often performing worse on faces underrepresented in training data.

Medical Imaging AI systems can flag potential tumours, fractures, or anomalies in X-rays, MRIs, and CT scans. They're trained on images that have been annotated by medical professionals, and they serve as a "second opinion" rather than a replacement for human diagnosis.

Image Generation

Systems like DALL-E, Midjourney, and Stable Diffusion generate images from text descriptions. The underlying approach differs from LLMs but shares some principles.

How It Works: Diffusion Models

Most modern image generators use "diffusion" models. During training, the system is shown millions of images paired with text descriptions. It's trained on a seemingly odd task: start with an image, gradually add random noise until it's pure static, then reverse the process — predicting how to remove noise step by step to recover the original image, conditioned on the text description. How close the recovered image is to the original forms the ‘reward’ mechanism — and during the training process, different parameters are tried and changed to get better results.

After training, generation works in reverse: start with pure noise, and iteratively remove it while steering toward an image that matches your text prompt. Each step slightly refines the image based on patterns encoded during training.

The Human Layer

Training data consists of image-caption pairs, often scraped from the internet or licensed from stock photo services. What's included shapes what the model can generate.
Engineers define safety filters to block generation of certain content (violence, explicit material, public figures in compromising situations).
"Fine-tuning" on curated datasets can specialise a model for particular styles or domains.
Prompt engineering (how you phrase your request) dramatically affects output quality.

Limitations These systems don't understand images conceptually. They've encoded statistical patterns about what pixels tend to appear together given certain text descriptions. This is why they struggle with:

Accurate text rendering within images
Consistent counting (ask for "five apples" and you might get four or six)
Coherent spatial relationships
Anatomical accuracy (the infamous "too many fingers" problem)

Video Generation

Video generation extends image generation into the time dimension, with significant additional complexity.

Current Approaches Systems like Sora, Runway, and Pika work by generating sequences of frames that are temporally coherent, each frame follows plausibly from the previous one. Some approaches generate keyframes first and then fill in intermediate frames; others process video as a continuous sequence across both space and time.

The Challenges:

Video is dramatically harder than images because:

Temporal consistency: objects must maintain their appearance and obey physics across frames
Computational cost: a 10-second video at 30fps requires generating 300 coherent images
Motion dynamics: movements must look natural, not jerky or physically impossible

Current State:

As of 2025, video generation can produce impressive short clips but struggles with:

Longer durations (quality degrades over time)
Complex actions and interactions
Consistent characters across scenes
Accurate physics (objects may morph, float, or behave strangely)

These systems are improving rapidly, but they remain tools for creating short-form content or rough drafts that require significant human curation.

Recommendation Systems

When Netflix suggests your next show or Amazon recommends products, you're interacting with a recommendation system. These are among the most widely deployed AI systems, shaping what billions of people see online every day.

How They Work

Recommendation systems predict what content you'll engage with by finding patterns in behaviour data. The core approaches include:

Collaborative filtering: The system finds users whose past behaviour resembles yours, then recommends things those similar users engaged with that you haven't seen yet. If you and another user both watched the same ten shows and rated them similarly, the system assumes you might like what they watched next. You're never compared to just one person, the system identifies patterns across millions of users.

Content-based filtering: The system categorises items by their attributes (genre, director, keywords, price range) and recommends items similar to ones you've engaged with before. If you've watched three documentaries about nature, it might recommend more documentaries or nature content.

Hybrid approaches: Most real-world systems combine both methods, along with additional signals like trending content, recency, and business rules set by humans (like promoting original content or sponsored products).

What They Optimise For

Recommendation systems optimise for whatever metric humans tell them to optimise for. Common targets include:

Clicks or taps
Watch time or reading time
Purchases
Return visits

These metrics are proxies for engagement, but they don't necessarily align with user satisfaction or wellbeing. A system optimised for watch time might recommend content that's compulsive rather than fulfilling. The choice of what to optimise is a human decision with significant consequences.

The Feedback Loop

Recommendation systems create self-reinforcing cycles. The system shows you content based on your past behaviour, you engage with some of it, and that engagement becomes new data that shapes future recommendations. Over time, this can narrow what you're exposed to, or amplify engagement patterns that may not reflect your actual preferences.

Speech Systems

Speech-to-Text Systems that convert spoken audio to written text power voice assistants, transcription services, and video captioning. They're trained on thousands of hours of audio paired with transcripts, adjusting parameters to minimise transcription errors.

Text-to-Speech The reverse process powers audiobook narration, GPS directions, and voice assistant responses. Modern systems can produce remarkably human-like speech, though they're generating audio based on statistical patterns, not communicating meaning.

The Common Thread

Across all these systems, the pattern is consistent:

Humans define the task — what the system should optimise for
Humans curate the data — what examples the system adjusts its parameters on
Humans design the architecture — what structure the system uses
Humans set the constraints — what the system is and isn't allowed to do
Humans evaluate the outputs — whether the system is performing acceptably

AI systems are tools that encode patterns from data into adjustable parameters. They can be powerful and useful, but they don't have goals, understanding, or judgment. Those come from the humans who build and deploy them.

A Quick Guide to AI

A Quick Guide to AI

Back to all AI resources

Responsible AI Due Diligence Tool

Introduction

Glossary

How does a chatbot work?

How do other AI systems work?

How does an AI agent work?

Why does trustworthy AI matter?

The cost of compute

How do other AI systems work?

Applications of Chatbots

Predictive and Classification AI Systems

Computer Vision

Image Generation

Video Generation

Recommendation Systems

Speech Systems

The Common Thread