AI Magic Part 2—How important is model size?

This is Part 2 of a series of blog posts explaining how AI language models work, what makes them great, how we test their quality, and how you can get the most out of your AI experiences. These posts should help novices and experts alike have better understanding of the technology that makes AI Dungeon, Voyage, and other AI experiences work.

In this post, we’re going to explore the factors that go into improving the quality of language models. It’s about more than just size!

What goes into making an AI Dungeon model?

Although AI Dungeon’s models start out as generic language models, there are several important things we do to customize the models to work as well as possible for the task at hand.

An effective AI requires the right balance of size, finetuning, selection, and parameters.

These factors include choosing a size, finetuning, selection, and parameters.

Choosing a Size

The size of a language model influences how well it works. Model size is measured in parameters, which are elements of the AI language model that has learned “facts” from training data.

At the time of this writing, one of the largest language models available in the world is GPT-3 Davinci, developed by OpenAI. It was trained on a supercomputer built specially for that purpose and uses immense resources. Davinci is estimated to have 175 billion parameters. Google is currently working on a new model that reportedly has 1.6 trillion parameters.

A larger model can theoretically understand a wider variety of topics, writing styles, and writing applications. And in many cases, quality correlates with size. However, we’re starting to learn there may be a point of diminishing returns with size. As models scale up in size, the quality of data they are trained on begins to decrease, actually making the models perform worse than smaller models. Size is important, but so is the quality of the training data.

Limitations of larger models

Larger models are also more expensive to run as they require larger, more capable super computers to generate outputs. And it takes more time to generate a response. To counter these issues, Google is developing a technique called “Switch Transformer” to work with their 1.6 trillion parameter model. “Switch Transformer” will essentially use only a fraction of the of the full model for certain tasks, saving on costs significantly. It’s sort of like turning off all the lights in your house when you aren’t using them to save power.

Training the language model for Griffin, our smallest model, took 15 billion trillion (1.5e22) operations. Training the model behind Dragon took over 200 times more operations than Griffin did.

The earliest version of AI Dungeon used a model small enough to be run on a powerful desktop PC, but those days are long gone. When a single player takes one action using Griffin, that action consumes significantly more computing power than a home PC could provide. With AI Dungeon, we have thousands playing at any given time. We have to rent time on supercomputers to handle all this traffic.

Unlike traditional games, anyone using AI based tools and experiences will likely have their actions limited in some way to mitigate cost. They may also cost more than traditional games. Most traditional games you play on your phone run on the computer inside of your phone—your phone is doing all the computing work and you are only paying for the electricity to power the phone, as well as the cost of buying a new phone every so often to continue being able to power them. The computing costs are so small you don’t even think about them. With AI, the costs are far more visible to companies and users alike.

Finetuning

One way of making models work better is a process called finetuning. The neural net is trained with additional text data that more closely resembles the desired output. The training is set up so that the AI pays more attention to this supplemental data. Instead of simply predicting what letters come next based on all possible scenarios, finetuning guides the model to concentrate its predictive ability on the scenarios we care about. For AI Dungeon, this would be an interactive roleplaying game.

We select examples of the kind of language we want in an RPG: rich in action, full of descriptive language, second person. The latest version of Griffin was finetuned on the equivalent of hundreds of novels worth of text.

We have to carefully select the text used as finetuning data. It affects not only the style, but also what kinds of events tend to happen, and even the names of characters who show up. Our collection of finetuning stories are frequently being improved to give a better experience. Some products even allow you to select different finetunes, to have some influence over the output of the language models.

Selection

If we use the largest language models, it's just too expensive to generate more than one result. But with the smaller models, we can generate multiple possible continuations of adventures and then select the best one from among them. To do that, we've trained another neural network to decide which generated text is the least repetitive, the most entertaining, or the funnest continuation. When you turn on the "Train the AI" setting (thanks for doing that, by the way!) your feedback helps to improve this selection. This can have major positive effects. Try the Hydra models to see how much this improves results. It allows faster, smaller models to deliver complexity and nuance often associated with larger models.

Parameters

There are many parameters that allow you to change what a model does with the probabilities it outputs. If you always choose the most likely continuation, the model gets predictable and repetitive. Using a setting called temperature, you can actually encourage the AI to select from a wider variety of potential responses. In this case, the model’s output will seem more creative, wild, and entertaining. We also use parameters to penalize repetition and guide the model to choose words that are varied but still appropriate to the situation.

How do you know you're improving?

We constantly get feedback that our models are improving, or getting worse...even when nothing has changed. That’s because AI model quality can feel completely subjective. Model response quality is difficult to quantify, measure and track.

We're constantly testing our changes to see how they compare to previous versions in the most objective way we can. When people are given a choice between two outputs, each of which comes from a different variation of a model, their choice provides feedback on which model is working better. This helps us decide which changes really make an improvement in the user experience.

Coming in part 3...

The next post in our series will dive a bit deeper in to AI costs to give players real world examples of what it takes to run AI experiences.