What are input and output tokens in AI?

This is how our interactions with AI are broken down into bits and bytes, and how pricing is defined.

In the context of AI, particularly language models like GPT (Generative Pretrained Transformer), input tokens and output tokens refer to the units of text that the model processes and generates, respectively. These tokens are the building blocks that allow the model to interpret and generate language.

Input tokens

Input tokens are the pieces of text that you provide to the model as input. This could be a sentence, a question or any other kind of prompt the model needs to process.

When you enter text, the language model first breaks it down into smaller units called tokens. These tokens can be individual characters, words or sub-words, depending on the model’s tokenization process.


For example, the sentence “Hello, how are you?” might be broken down into several tokens, such as: “Hello”, “,”, “how”, “are”, “you”, “?”.

The model uses these tokens to understand the meaning and context of the input and generate a response.

Tokenization is typically done using a process called Byte Pair Encoding (BPE) or similar algorithms that aim to split text into the most efficient and meaningful pieces for the model.

Output tokens

Output tokens are the pieces of text that the model generates as a response to the input. After processing the input tokens, the model predicts the next most likely tokens to produce a coherent and contextually relevant output.

The model generates output tokens one at a time, predicting the next token based on the previous ones, until it reaches a predefined limit or completes the response.

For example, if the input is “What is the capital of France?”, the model might generate the output “The capital of France is Paris.” Each word or punctuation mark in this output is considered a token.

Tokens and model limitations

Language models like GPT have a token limit, which refers to the maximum number of tokens they can handle in a single input-output interaction. This limit includes both the input tokens and the output tokens. For example, if a model has a token limit of 4096 tokens, that means the total number of tokens in the input plus the output must not exceed that number.

If the input is too long, the model may truncate it or may not be able to generate a sufficiently long output.

Token limits vary between different models. For example, GPT-4 may handle up to 8,000 or 32,000 tokens in one prompt, depending on the version.

Why tokens matter

Tokenizing text into manageable pieces allows the model to process and generate language more efficiently. It also helps the model deal with the complexities of human language, such as word variations, sentence structures, and punctuation.

In many AI systems, the number of tokens processed can directly influence the cost of using the model, as AI services may charge based on the number of tokens in both the input and output.

Other modalities for handling inputs and outputs

Tokens might be the primary method language models like GPT use to handle inputs and outputs, but they are not the only method.

While large language models (such as GPT) focus on text-based tokens, AI systems can also handle other types of inputs and outputs beyond text tokens.

AI models like DALL·E, CLIP and Stable Diffusion handle images as inputs and outputs. In these cases, AI processes pixels or embeddings of images, rather than textual tokens. The input might be an image (for image recognition) or a text prompt that generates an image.

For speech recognition or text-to-speech models—such as Whisper or Tacotron—the input could be audio signals (converted into spectrograms or other representations) or text, and the output could be transcriptions of speech or spoken responses.

Video AI models process and generate sequences of frames, allowing for tasks like video analysis, generation and transformation.

Some AI models are designed to process structured data such as graphs, tables and databases. These models do not use tokens in the same way that text-based models do. For example, AI used in graph neural networks (GNNs) works with nodes and edges, and models that deal with tabular data (such as AutoML models) process features in a structured form.

Some advanced AI systems, like GPT-4 and CLIP, are multimodal, meaning they can handle both text and images. These models don’t always use tokens in the traditional sense but instead work with various embeddings (vector representations) of input data, like a combination of textual and visual features.

Is token-based pricing the only model for AI?

No, token-based pricing is not the only model used for pricing AI services, but it is the most common model for text-based AI models. The pricing model varies depending on the type of AI service, the complexity of the model, and the application. Here are some common pricing models for AI:

1. Token-Based Pricing

Common for Text Models: In the case of large language models like GPT, token-based pricing is often used because it directly correlates with the amount of text processed (both input and output). Since token count determines the processing effort required, it serves as a fair metric for charging users based on resource usage.

2. Time-Based Pricing

Usage in Real-Time Processing: Some AI systems, particularly those with more real-time needs like speech recognition or video processing, may charge based on the time spent processing the input, such as seconds or minutes of audio or video analyzed.

3. Subscription or Tiered Pricing

For SaaS Models: Many AI services, particularly in cloud-based platforms, use subscription models where customers pay a fixed price based on the volume of usage (like API calls) or a set of features included. These may include monthly or yearly subscriptions. Some platforms offer tiered pricing, where higher levels come with more features, increased usage limits, or priority processing.

4. Pay-Per-Request or Pay-Per-Feature

For Specialized AI Services: Certain AI platforms, especially those in fields like image recognition, video processing, or AI-driven analytics, may charge based on specific requests or features used. This might be based on the complexity of the task (e.g., detecting objects in an image vs. simple image tagging).

5. Resource-Based Pricing

For Model Training or Compute-Intensive Tasks: When training large models or using cloud-based AI infrastructure, pricing may be based on the compute resources used (such as CPU/GPU time or memory). In these cases, you’re paying for the underlying infrastructure that the model runs on.