On large language models
Posted by
Author's note: I originally wrote this
while I was
still in my honeymoon phase with language models. My feelings
toward them have soured substantially since then. This article
doesn't touch the ethical side of things; it's just about my
experience trying to get high quality output. To summarize my
opinion on the ethics: it's real fuckin' bad. Don't use
LLMs. As I write this post, a
large language model
running on my GPU is generating text based on a prompt I
wrote. The prompt is 515 words long, and is formatted as the
beginning of a story on an erotic literature website, including
the title, tags, summary, and the first few paragraphs of the
body. The story described in the prompt is about an elven queen
who narrowly survives an assassination attempt while on the
road. Now she and her last surviving knight travel through the
wilderness in secret, trying to reach the palace before they
are attacked again. Later in the story, the queen will find herself depending on
her knight more and more. Here in the wilderness, where her
word is not law, a crown doesn't mean very much. She will start
to defer to her protector, and the nature of their relationship
will gradually shift until she is happily following his every
order. It won't be too long until the queen submits to him
sexually as well. The prompt ends with the knight returning to the cave after
foraging for food, only to discover his queen completely
naked. The language model generates text quickly. It can get
through 500 tokens of output in about 2 minutes on my
video card. I watch it as it runs. The queen is helping the
knight wash the blood off his hands, and he's finding it quite
pleasurable. Large language models, if you didn't know, are text
prediction engines. A chunk of text is fed in as a prompt, and
the model predicts what word is most likely to come next. The
model can be made to output longform text by having it continue
to predict what word comes next, over and over. Language models have a limited context size. In the case of
this model, that context limit is 2048 tokens (slightly smaller
in my case, because my GPU's VRAM can only handle about 1600
tokens). All the model knows is the set of parameters it
generated from its original training data, and the tokens in
its current context. It cannot retain any information beyond
that context limit. Any previous sessions effectively never
happened. When
LaMDA
told Blake Lemoine that it liked to meditate in its free time,
it did so, not because it actually likes to meditate, but
because the context of the session so far matched
patterns found in science fiction novels about artificial
intelligence. LaMDA cannot meditate. It cannot think.
It does nothing at all when it's not predicting text. I check back on the story. A man named Gurion has shown up
in the cave. He is threatening to kill the queen and the
knight. The knight makes quick work of the assassin, slicing
his bowstring and then stabbing him through the heart. I want to run bigger models, but the 4-bit-quantized
13-billion-parameter model with a 1600 token context is the best
I can manage with 12 GB of VRAM. I can run larger models on my
CPU, but that is significantly slower. Slow enough that I only
bother running it overnight. The context size limit is annoying. It's easy to get around
the limit by simply feeding the output back into it as a new
prompt, but it will forget older details that fall beyond
the context limit. You can further improve the results by
including a standard header on each prompt, describing the
overall structure of the story. However, a prompt made of a
standard header plus the last thousand or so tokens generated
still loses information. What works better is compressing all the output
so far. My iterated generation shell script checks the total
output size after each iteration, and compresses it to a target
size if necessary using
Open Text Summarizer.
The script assumes that the beginning and end of the output are
the most important parts of the prompt, so it doesn't compress
those, just the stuff in the middle. OTS works okay for this, but it's really more designed for
news articles than for fiction. A better solution might involve
feeding the text into an instruction-tuned language model for
summarization, but you'd need to do it in chunks to get around
the model's context size limit. It would also take much longer;
OTS works in a fraction of a second. But long output from a language model is always going to
have bad results, even if you don't have to compress the
context. When the output portion of the context is not
significantly shorter than the human-supplied portion, the
model is effectively trying to match itself. It just
gets worse and worse the longer it runs, like a VHS recording
of a VHS recording of a VHS recording. I can watch in real time
as it accidentally repeats itself once, and then notices that
it repeated itself, and then starts repeating itself constantly
to maintain the pattern. I've tried writing automatic quality check scripts that will
reject output if it looks like it's stuck in a loop. That
doesn't help. It still wants to generate looping
output. And so if the quality checker rejects verbatim loops,
it paraphrases. It gets stuck in endless conversations.
It makes bulleted lists of every single medieval occupation.
It stops breaking output into sentences, because the loop
checker compares at a per-sentence level. All in all, this makes totally unsupervised long-form
generation impossible. The only solution I have found is to
have it stop every 3-5 iterations and wait for a pass by a
human editor. The editor can revise, change, add to, or
completely delete output before passing control back to the
model. This is sex. When the language model matches patterns in its own output,
it is effectively reproducing incestuously (or perhaps
asexually). There is no way for it to get new genetic
information other than mutation, so all its worst traits
get emphasized over time. The loop checker acts as a form of
natural selection, preventing the most obviously bad offspring
from reproducing. But those that can adapt, by paraphrasing
instead of looping exactly, survive. The only way for the model to get new genetic information is
for an outside source to intervene -- the human editor. The
human feeds new information in, and the resulting iteration is
much healthier as a result. It scares the shit out of me that there is so much unlabeled
language model output on the internet. This stuff is getting
fed back into the next generation of models as training data,
poisoning the gene pool. And stuff in the training data winds
up in the model weights, where it can't be easily fixed or
ripped out. It's much more insidious than having it in the
context. It's my turn to revise the output from my script. I pull
it up with vim. So far, the output is okay. A second stranger
has shown up in the cave, and he is offering to let the queen
and the knight use his magic box to turn invisible so that
they can slip by the assassins just up the road unnoticed. Fuck it. I want something more erotic. I delete the last few
paragraphs, and instead write a short passage of my own where
the queen starts coming on to the knight. I pass control back
to the model. I hope our child is healthy.conda activate textgen
cd ~/llm
./scripts/page9/page9.sh ./prompts/queens-knight 3
===
Iteration 1
Beginning generation of queens-knight.
To view log:
tail -f './output/queens-knight/llm.log'
To view output in real time:
tail -f './output/queens-knight/queens-knight.txt.response'
2023-04-01:T21:08:10: Started generating
Iteration PID: 386
2023-04-01:T21:10:03: Output complete.
Response length: 1758, min acceptable: 600
Truncating last 200 from output just in case there's an [end of text] tendency.
Loop check report: Found 0 lines repeated. Maximum acceptable: 2
Acceptable
Considering summarizing...
Input size is 4421
Total size is less than 4800, no need to summarize
===
Iteration 2
Beginning generation of queens-knight.
To view log:
tail -f './output/queens-knight/llm.log'
To view output in real time:
tail -f './output/queens-knight/queens-knight.txt.response'
2023-04-01:T21:10:03: Started generating
Iteration PID: 527
2023-04-01:T21:12:02: Output complete.
Response length: 1858, min acceptable: 600
Truncating last 200 from output just in case there's an [end of text] tendency.
Loop check report:
2 saidthequen
Found 1 lines repeated. Maximum acceptable: 2
Acceptable
Considering summarizing...
Input size is 6079
Beginning summarization...
Target body compression: 75%
===
Iteration 3
Beginning generation of queens-knight.
To view log:
tail -f './output/queens-knight/llm.log'
To view output in real time:
tail -f './output/queens-knight/queens-knight.txt.response'
2023-04-01:T21:12:02: Started generating
Iteration PID: 699
2023-04-01:T21:14:09: Output complete.
Response length: 1766, min acceptable: 600
Truncating last 200 from output just in case there's an [end of text] tendency.
Loop check report:
4 saidthequen
Found 1 lines repeated. Maximum acceptable: 2
Acceptable
To view and edit the output:
vim './output/queens-knight/queens-knight.txt'
To continue generating:
'./scripts/page9/cont9.sh' './prompts-queens-knight/' '3'