• qwertyasdef@programming.dev
    link
    fedilink
    arrow-up
    12
    ·
    1 year ago

    Ask it a question about basketball. It looks through all documents it can find about basketball…

    I get that this is a simplified explanation but want to add that this part can be misleading. The model doesn’t contain the original documents and doesn’t have internet access to look up the documents (though that can be added as an extra feature, but even then it’s used more as a source to show humans than something for the model to learn from on the fly). The actual word associations are all learned during training, and during inference it just uses the stored weights. One implication of this is that the model doesn’t know about anything that happened after its training data was collected.

      • Taival@suppo.fi
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        1 year ago

        Not quite ELI5 but I’ll try “basic understanding of calculus” level.

        In very broad terms, the model learns complex relationships between words (or tokens to be specific, explained below) as probabilistic scores. At its simplest, this could mean the likelihood of one word appearing next to another in the massive amounts of text the model was trained with: the words “apple” and “pie” are often found together, so they might have a high-ish score of 0.7, while the words “apple” and “chair” might have a lower score of just 0.2. Recent GPT models consist of several billions of these scores, known as the weights. Once their values have been estabilished by feeding lots of text through the model’s training process, they are all that’s needed to generate more text.

        Without getting into the math too much, this is how a GPT model then uses these numbers to come up with words:

        • The input prompt is first chopped up into tokens that are each assigned a number. For example, the OpenAI tokenizer translates “Hello world!” into the numbers [15496, 995, 0]. You can think of this as the A=1, B=2, C=3… cipher we all learnt as kids, but the numbers are also assigned to common words, syllables and punctuation.
        • These numbers are inserted into a massive system of equations where they are multiplied together with the billions of weights of the model in a specific manner. This calculation results in a probability score from 0 to 1 for each token known by the model, representing how likely that token is to appear next in sequences that look similar to your input.
        • One of the tokens with the highest scores is chosen as the model’s output semi-randomly to provide variance.
        • This cycle is then repeated over and over, generating the text one token at a time.

        In reality we’re not quite so sure what the weights represent to the model exactly, but this is the gist of it. All we know is that they signify the importances or non-importances that the model places on some pattern that was present in the training data. Some of these patterns could be just simple two-word pairs, but many are probably much more complicated. Lots of researchers are currently trying to get a better idea of how these numbers are actually affecting the model’s output.