Generative artificial intelligence (GenAI), especially ChatGPT, captures everyone’s attention. The transformer based large language models (LLMs), t

A Tutorial on LLM. Generative artificial intelligence… | by Haifeng Li | Sep, 2023 | Medium

submited by
Style Pass
2023-09-15 15:00:03

Generative artificial intelligence (GenAI), especially ChatGPT, captures everyone’s attention. The transformer based large language models (LLMs), trained on a vast quantity of unlabeled data at scale, demonstrate the ability to generalize to many different tasks. To understand why LLMs are so powerful, we will deep dive into how they work in this post.

Formally, a decoder only language model is simply a conditional distribution p(xi|x1···xi−1) over next tokens xi given contexts x1 · · · xi−1. Such a formulation is an example of Markov process, which has been studied in many use cases. This simple setup also allows us to generate token by token in an autoregressive way.

Before our deep dive, I have to call out the limitation of this formulation to reach artificial general intelligence (AGI). Thinking is a non-linear process but our communication device, mouth, can speak only linearly. Therefore, language appears a linear sequence of words. It is a reasonable start to model language with a Markov process. But I suspect that this formulation can capture the thinking process (or AGI) completely. On the other hand, thinking and language are interrelated. A strong enough language model may still demonstrates some sort of thinking capability as GPT4 shows. In what follows, let’s check out the scientific innovations that makes LLMs to appear intelligently.

There are many ways to model/represent the conditional distribution p(xi|x1···xi−1). In LLMs, we attempt to estimate this conditional distribution with a neural network architecture called Transformer. In fact, neural networks, especially a variety of recurrent neural networks (RNNs), have been employed in language modeling for long time before Transformer. RNNs process tokens sequentially, maintaining a state vector that contains a representation of the data seen prior to the current token. To process the n-th token, the model combines the state representing the sentence up to token n-1with the information of the new token to create a new state, representing the sentence up to token n. Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the state continues to encode contextual information about the token. Unfortunately, the vanishing gradient problem leaves the model’s state at the end of a long sentence without precise, extractable information about preceding tokens. The dependency of token computations on the results of previous token computations also makes it hard to parallelize computation on modern GPU hardware.

Leave a Comment