On large language models
Posted by
on .Author's note: I originally wrote this while I was still in my honeymoon phase with language models. My feelings toward them have soured substantially since then. This article doesn't touch the ethical side of things; it's just about my experience trying to get high quality output. To summarize my opinion on the ethics: it's real fuckin' bad. Don't use LLMs.
As I write this post, a large language model running on my GPU is generating text based on a prompt I wrote. The prompt is 515 words long, and is formatted as the beginning of a story on an erotic literature website, including the title, tags, summary, and the first few paragraphs of the body.
The story described in the prompt is about an elven queen who narrowly survives an assassination attempt while on the road. Now she and her last surviving knight travel through the wilderness in secret, trying to reach the palace before they are attacked again.
Later in the story, the queen will find herself depending on her knight more and more. Here in the wilderness, where her word is not law, a crown doesn't mean very much. She will start to defer to her protector, and the nature of their relationship will gradually shift until she is happily following his every order. It won't be too long until the queen submits to him sexually as well.
The prompt ends with the knight returning to the cave after foraging for food, only to discover his queen completely naked.
The language model generates text quickly. It can get through 500 tokens of output in about 2 minutes on my video card. I watch it as it runs. The queen is helping the knight wash the blood off his hands, and he's finding it quite pleasurable.
Large language models, if you didn't know, are text prediction engines. A chunk of text is fed in as a prompt, and the model predicts what word is most likely to come next. The model can be made to output longform text by having it continue to predict what word comes next, over and over.
Language models have a limited context size. In the case of this model, that context limit is 2048 tokens (slightly smaller in my case, because my GPU's VRAM can only handle about 1600 tokens). All the model knows is the set of parameters it generated from its original training data, and the tokens in its current context. It cannot retain any information beyond that context limit. Any previous sessions effectively never happened. When LaMDA told Blake Lemoine that it liked to meditate in its free time, it did so, not because it actually likes to meditate, but because the context of the session so far matched patterns found in science fiction novels about artificial intelligence. LaMDA cannot meditate. It cannot think. It does nothing at all when it's not predicting text.
I check back on the story. A man named Gurion has shown up in the cave. He is threatening to kill the queen and the knight. The knight makes quick work of the assassin, slicing his bowstring and then stabbing him through the heart.
I want to run bigger models, but the 4-bit-quantized 13-billion-parameter model with a 1600 token context is the best I can manage with 12 GB of VRAM. I can run larger models on my CPU, but that is significantly slower. Slow enough that I only bother running it overnight.
The context size limit is annoying. It's easy to get around the limit by simply feeding the output back into it as a new prompt, but it will forget older details that fall beyond the context limit. You can further improve the results by including a standard header on each prompt, describing the overall structure of the story. However, a prompt made of a standard header plus the last thousand or so tokens generated still loses information.
What works better is compressing all the output so far. My iterated generation shell script checks the total output size after each iteration, and compresses it to a target size if necessary using Open Text Summarizer. The script assumes that the beginning and end of the output are the most important parts of the prompt, so it doesn't compress those, just the stuff in the middle.
OTS works okay for this, but it's really more designed for news articles than for fiction. A better solution might involve feeding the text into an instruction-tuned language model for summarization, but you'd need to do it in chunks to get around the model's context size limit. It would also take much longer; OTS works in a fraction of a second.
But long output from a language model is always going to have bad results, even if you don't have to compress the context. When the output portion of the context is not significantly shorter than the human-supplied portion, the model is effectively trying to match itself. It just gets worse and worse the longer it runs, like a VHS recording of a VHS recording of a VHS recording. I can watch in real time as it accidentally repeats itself once, and then notices that it repeated itself, and then starts repeating itself constantly to maintain the pattern.
I've tried writing automatic quality check scripts that will reject output if it looks like it's stuck in a loop. That doesn't help. It still wants to generate looping output. And so if the quality checker rejects verbatim loops, it paraphrases. It gets stuck in endless conversations. It makes bulleted lists of every single medieval occupation. It stops breaking output into sentences, because the loop checker compares at a per-sentence level.
All in all, this makes totally unsupervised long-form generation impossible. The only solution I have found is to have it stop every 3-5 iterations and wait for a pass by a human editor. The editor can revise, change, add to, or completely delete output before passing control back to the model.
This is sex.
When the language model matches patterns in its own output, it is effectively reproducing incestuously (or perhaps asexually). There is no way for it to get new genetic information other than mutation, so all its worst traits get emphasized over time. The loop checker acts as a form of natural selection, preventing the most obviously bad offspring from reproducing. But those that can adapt, by paraphrasing instead of looping exactly, survive.
The only way for the model to get new genetic information is for an outside source to intervene -- the human editor. The human feeds new information in, and the resulting iteration is much healthier as a result.
It scares the shit out of me that there is so much unlabeled language model output on the internet. This stuff is getting fed back into the next generation of models as training data, poisoning the gene pool. And stuff in the training data winds up in the model weights, where it can't be easily fixed or ripped out. It's much more insidious than having it in the context.
It's my turn to revise the output from my script. I pull it up with vim. So far, the output is okay. A second stranger has shown up in the cave, and he is offering to let the queen and the knight use his magic box to turn invisible so that they can slip by the assassins just up the road unnoticed.
Fuck it. I want something more erotic. I delete the last few paragraphs, and instead write a short passage of my own where the queen starts coming on to the knight. I pass control back to the model. I hope our child is healthy.