A solid grasp of valid shapes and dimensions for matrix multiplication is essential. Familiarity with this topic is strongly recommended before proceeding.
Most generative AI models are built using a decoder-only architecture. In this blog post, we’ll explore a simple text generation model, as illustrated below.
Let’s start with an input example for reference. The sentence Hello world ! can be tokenized into three parts: Hello, world, and !. Additionally, two auxiliary tokens, <bos> and <eos>, are appended to represent the beginning of sentence and end of sentence, respectively. This ensures the input is shifted correctly.
After tokenization, the input becomes a tensor like [12, 15496, 2159, 5145]. When passed to the model in a batch, an extra dimension is added, resulting in [[12, 15496, 2159, 5145]]. For simplicity, we’ll focus on tensor dimensionalities, representing the input as [1,4] [1, 4] [ 1 , 4 ] , where 1 is the batch size and 4 is the sequence length.
This layer does not alter the tensor dimensions but injects positional information into the input. This is crucial because the input undergoes parallel computations later in the architecture, and positional encoding ensures the model retains information about the order of tokens in the sequence.