Inference vs. training
In AI computing, there are two primary tasks: training and inference.
Training is the stage when vast amounts of data are fed into an AI model while it is being built. This could include a corpus of text from books and websites, huge image datasets like LAION-5B, or millions of hours of YouTube videos. The process is extremely computationally intensive. During the first phase of the current generative AI boom, the industry followed a simple scaling rule: more data fed into more GPUs = a more capable model.
That approach worked for several iterations, until the massive leaps in performance began to shrink. Other techniques, such as reasoning models, provided the next wave of breakthroughs.
Inference is the process of running a trained AI model—ingesting a prompt and generating a response. This is the most common form of AI computing, as companies are not always training a big new model. Inference is less computationally intensive than training.
While the early rush to acquire Nvidia GPUs was focused on amassing large numbers of chips to train successive generations of models, tech companies like Google and Amazon are now building custom inference chips that are far cheaper and more efficient than the most powerful GPUs used for training.
In AI inference, the trained model then makes predictions on real-world input data. AI inference works by using what it has 'learned'—that is, the model parameter updates that were made in order to improve its performance on the training data—to infer the correct output for the new input data. Unlike in model training, inference entails only a forward pass.— IBM