Tokenization

Notebook author: Chung-En Johnny Yu

Update date: 2025/09/29

Source:

Hands-on practice this notebook on your Google Colab:

Click here to open this notebook
Now, run the code and practice it!

Introduction¶

LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions).

Since we can’t visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensional embedding space.

Tokenizing text¶

# Packages that are being used in this notebook

from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.8.0+cu126
tiktoken version: 0.11.0

# Fetch the-verdict.txt

import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

# Open txt file and have a glimpse

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no

The goal is to tokenize and embed this text for an LLM.

Let’s develop a simple tokenizer based on some simple sample text that we can then later apply to the text above.

The following regular expression will split on whitespaces.

# Split on whitespaces

import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']

We don’t only want to split on whitespaces but also commas and periods, so let’s modify the regular expression to do that as well.

# Split on whitespaces, commas, periods

result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']

As we can see, this creates empty strings, let’s remove them.

# Strip whitespace from each item and then filter out any empty strings

result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']

This looks pretty good, but let’s also handle other types of punctuation, such as periods, question marks, and so on.

# Handle more punctuation

text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']

This is pretty good, and we are now ready to apply this tokenization to the raw text.

# Tokenize on the first 30 words of the fetched text

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']

Let’s calculate the total number of tokens.

print(len(preprocessed))

Converting tokens into token IDs¶

Next, we convert the text tokens into token IDs that we can process via embedding layers later.

From these tokens, we can now build a vocabulary that consists of all the unique tokens

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

vocab = {token:integer for integer,token in enumerate(all_words)}

# Print the first 20 entries in this vocabulary

for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 20:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)

Below, we illustrate the tokenization of a short sample text using a small vocabulary.

Putting it now all together into a tokenizer class.

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        """Turn text into token IDs"""
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        """Turn token IDs back into text"""
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # replace spaces before the specified punctuations
        return text

We can use the tokenizer to encode (that is, tokenize) texts into integers.

These integers can then be embedded (later) as input of/for the LLM.

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]

We can decode the integers back into text

tokenizer.decode(ids)

tokenizer.decode(tokenizer.encode(text))

Adding special tokens¶

It’s useful to add some “special” tokens for unknown words and to denote the end of a text.

Some tokenizers use special tokens to help the LLM with additional context.
Some of these special tokens are:
- [BOS] (beginning of sequence) marks the beginning of text
- [EOS] (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
- [PAD] (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
- [UNK] to represent words that are not included in the vocabulary

Note that GPT-2 does not need any of these tokens mentioned above but only uses an <|endoftext|> token to reduce complexity.

GPT also uses the <|endoftext|> for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are).
GPT-2 does not use an <UNK> token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section.

The following code produces an error because the word “Hello” is not contained in the vocabulary.

tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipython-input-2162118319.py in <cell line: 0>()
      3 text = "Hello, do you like tea. Is this-- a test?"
      4 
----> 5 tokenizer.encode(text)

/tmp/ipython-input-1925100589.py in encode(self, text)
     11             item.strip() for item in preprocessed if item.strip()
     12         ]
---> 13         ids = [self.str_to_int[s] for s in preprocessed]
     14         return ids
     15 

KeyError: 'Hello'

To deal with such cases, we can add special tokens like "<|unk|>" to the vocabulary to represent unknown words.

Since we are already extending the vocabulary, let’s add another token called "<|endoftext|>" which is used in GPT-2 training to denote the end of a text (and it’s also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.).

# Expand vocabulary

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}
len(vocab.items())

1132

# Check out the last 5 items in the vocabulary

for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)

We also need to adjust the tokenizer accordingly so that it knows when and how to use the new <unk> token.

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

Let’s try to tokenize text with the modified tokenizer:

tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.

tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

tokenizer.decode(tokenizer.encode(text))

Open this external website to play with an interactive tokenizer.

BytePair encoding (BPE)¶

BytePair encoding (BPE) allows the model to break down words that aren’t in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words.

Basic idea: Train the tokenizer on raw text to automatically determine the vocabulary.
Intuition: Common sequences of characters are represented by a single token, rare sequences are represented by many tokens.
For instance, if GPT-2’s vocabulary doesn’t have the word “unfamiliarword,” it might tokenize it as [“unfam”, “iliar”, “word”] or some other subword breakdown, depending on its trained BPE merges.
GPT-2 used it as its tokenizer.
Link to the original BPE tokenizer
In this notebook, we are using the BPE tokenizer from OpenAI’s open-source tiktoken library, which implements its core algorithms in Rust to improve computational performance.

# pip install tiktoken

# Check the package version

import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.11.0

tokenizer = tiktoken.get_encoding("gpt2")

# Encode unknown word to relatively large token ID, e.g., 50256

text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]

# Decode back to see it handle the unknown word correctly

strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.

BPE tokenizers break down unknown words into subwords and individual characters:

Data sampling with a sliding window¶

We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

# Tokenize the previous fetched txt

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

For each text chunk, we want the inputs and targets.

Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right.

enc_sample = enc_text[50:]
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]

One by one, the prediction would look like as follows:

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a

For now, we implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one.

import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.8.0+cu126

We use a sliding window approach, changing the position by +1:

Create dataset and dataloader that extract chunks from the input text dataset.

# Create a dataset for dataloader

from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# Create a dataloader

def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Let’s test the dataloader with a batch size of 1 for an LLM with a context size of 4:

# Get text and dataloader ready

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)

first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]

An example using stride equal to the context length (here: 4) as shown below:

We can also create batched outputs.

Note that we increase the stride here so that we don’t have overlaps between the batches, since more overlap could lead to increased overfitting.

dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

Creating token embeddings¶

The data is already almost ready for an LLM.

But lastly let us embed the tokens in a continuous vector representation using an embedding layer.

Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training.

Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

input_ids = torch.tensor([2, 3, 5, 1])

For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:

vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim) # matmul in NN

This would result in a 6x3 weight matrix:

print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer.

Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation.
An embedding layer is essentially a look-up operation.

Then, to convert a token with id 3 into a 3-dimensional vector, we do the following:

# Print the 4th row in the `embedding_layer` weight matrix

print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

To embed all four input_ids values above, we do:

print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)

Encoding word positions¶

Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:

Positional embeddings are combined with the token embedding vector to form the input embeddings for a LLM:

The BytePair encoder has a vocabulary size of 50,257.

Suppose we want to encode the input tokens into a 256-dimensional vector representation:

vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector.

If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:

max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])

token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
print(token_embeddings)

torch.Size([8, 4, 256])
tensor([[[ 0.4913,  1.1239,  1.4588,  ..., -0.3995, -1.8735, -0.1445],
         [ 0.4481,  0.2536, -0.2655,  ...,  0.4997, -1.1991, -1.1844],
         [-0.2507, -0.0546,  0.6687,  ...,  0.9618,  2.3737, -0.0528],
         [ 0.9457,  0.8657,  1.6191,  ..., -0.4544, -0.7460,  0.3483]],

        [[ 1.5460,  1.7368, -0.7848,  ..., -0.1004,  0.8584, -0.3421],
         [-1.8622, -0.1914, -0.3812,  ...,  1.1220, -0.3496,  0.6091],
         [ 1.9847, -0.6483, -0.1415,  ..., -0.3841, -0.9355,  1.4478],
         [ 0.9647,  1.2974, -1.6207,  ...,  1.1463,  1.5797,  0.3969]],

        [[-0.7713,  0.6572,  0.1663,  ..., -0.8044,  0.0542,  0.7426],
         [ 0.8046,  0.5047,  1.2922,  ...,  1.4648,  0.4097,  0.3205],
         [ 0.0795, -1.7636,  0.5750,  ...,  2.1823,  1.8231, -0.3635],
         [ 0.4267, -0.0647,  0.5686,  ..., -0.5209,  1.3065,  0.8473]],

        ...,

        [[-1.6156,  0.9610, -2.6437,  ..., -0.9645,  1.0888,  1.6383],
         [-0.3985, -0.9235, -1.3163,  ..., -1.1582, -1.1314,  0.9747],
         [ 0.6089,  0.5329,  0.1980,  ..., -0.6333, -1.1023,  1.6292],
         [ 0.3677, -0.1701, -1.3787,  ...,  0.7048,  0.5028, -0.0573]],

        [[-0.1279,  0.6154,  1.7173,  ...,  0.3789, -0.4752,  1.5258],
         [ 0.4861, -1.7105,  0.4416,  ...,  0.1475, -1.8394,  1.8755],
         [-0.9573,  0.7007,  1.3579,  ...,  1.9378, -1.9052, -1.1816],
         [ 0.2002, -0.7605, -1.5170,  ..., -0.0305, -0.3656, -0.1398]],

        [[-0.9573,  0.7007,  1.3579,  ...,  1.9378, -1.9052, -1.1816],
         [-0.0632, -0.6548, -1.0296,  ..., -0.9538, -0.5026, -0.1128],
         [ 0.6032,  0.8983,  2.0722,  ...,  1.5242,  0.2030, -0.3002],
         [ 1.1274, -0.1082, -0.2195,  ...,  0.5059, -1.8138, -0.0700]]],
       grad_fn=<EmbeddingBackward0>)

GPT-2 uses absolute position embeddings, so we just create another embedding layer:

context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

print(pos_embedding_layer.weight)

Parameter containing:
tensor([[ 1.7375, -0.5620, -0.6303,  ..., -0.2277,  1.5748,  1.0345],
        [ 1.6423, -0.7201,  0.2062,  ...,  0.4118,  0.1498, -0.4628],
        [-0.4651, -0.7757,  0.5806,  ...,  1.4335, -0.4963,  0.8579],
        [-0.6754, -0.4628,  1.4323,  ...,  0.8139, -0.7088,  0.4827]],
       requires_grad=True)

pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print("Shape of position embeddings:")
print(pos_embeddings.shape)
print("\nAbsolute position embeddings:")
print(pos_embeddings)

Shape of position embeddings:
torch.Size([4, 256])

Absolute position embeddings:
tensor([[ 1.7375, -0.5620, -0.6303,  ..., -0.2277,  1.5748,  1.0345],
        [ 1.6423, -0.7201,  0.2062,  ...,  0.4118,  0.1498, -0.4628],
        [-0.4651, -0.7757,  0.5806,  ...,  1.4335, -0.4963,  0.8579],
        [-0.6754, -0.4628,  1.4323,  ...,  0.8139, -0.7088,  0.4827]],
       grad_fn=<EmbeddingBackward0>)

To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:

input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)
print(input_embeddings)

torch.Size([8, 4, 256])
tensor([[[ 2.2288,  0.5619,  0.8286,  ..., -0.6272, -0.2987,  0.8900],
         [ 2.0903, -0.4664, -0.0593,  ...,  0.9115, -1.0493, -1.6473],
         [-0.7158, -0.8304,  1.2494,  ...,  2.3952,  1.8773,  0.8051],
         [ 0.2703,  0.4029,  3.0514,  ...,  0.3595, -1.4548,  0.8310]],

        [[ 3.2835,  1.1749, -1.4150,  ..., -0.3281,  2.4332,  0.6924],
         [-0.2199, -0.9114, -0.1750,  ...,  1.5337, -0.1998,  0.1462],
         [ 1.5197, -1.4240,  0.4391,  ...,  1.0494, -1.4318,  2.3057],
         [ 0.2893,  0.8346, -0.1884,  ...,  1.9602,  0.8709,  0.8796]],

        [[ 0.9662,  0.0952, -0.4640,  ..., -1.0320,  1.6290,  1.7771],
         [ 2.4468, -0.2154,  1.4984,  ...,  1.8766,  0.5595, -0.1423],
         [-0.3856, -2.5393,  1.1556,  ...,  3.6157,  1.3267,  0.4944],
         [-0.2487, -0.5275,  2.0009,  ...,  0.2930,  0.5977,  1.3300]],

        ...,

        [[ 0.1219,  0.3991, -3.2740,  ..., -1.1921,  2.6637,  2.6728],
         [ 1.2438, -1.6436, -1.1101,  ..., -0.7464, -0.9816,  0.5118],
         [ 0.1439, -0.2428,  0.7786,  ...,  0.8001, -1.5986,  2.4871],
         [-0.3077, -0.6329,  0.0536,  ...,  1.5188, -0.2060,  0.4254]],

        [[ 1.6095,  0.0535,  1.0871,  ...,  0.1512,  1.0996,  2.5603],
         [ 2.1284, -2.4306,  0.6478,  ...,  0.5593, -1.6896,  1.4126],
         [-1.4224, -0.0750,  1.9386,  ...,  3.3712, -2.4016, -0.3237],
         [-0.4752, -1.2234, -0.0847,  ...,  0.7834, -1.0744,  0.3429]],

        [[ 0.7802,  0.1387,  0.7277,  ...,  1.7101, -0.3304, -0.1471],
         [ 1.5791, -1.3749, -0.8234,  ..., -0.5420, -0.3528, -0.5756],
         [ 0.1382,  0.1226,  2.6528,  ...,  2.9576, -0.2933,  0.5577],
         [ 0.4520, -0.5711,  1.2128,  ...,  1.3198, -2.5226,  0.4127]]],
       grad_fn=<AddBackward0>)

In the initial phase of the input processing workflow, the input text is segmented into separate tokens.

Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary.

LLM

Table of Content

LLM

The problem with modeling long sequences