让我们用PyTorch从头开始构建我们自己的GPT模型。

今天，我们将离开我们的Vision Transformer系列，讨论构建生成预训练Transformer（GPT）的基本变体。

更准确地说，我们将构建一个自回归（二元）模型，也就是说，我们会在每次生成一个标记时考虑所有先前的标记。自回归模型通常会考虑先前的标记顺序地生成标记（字符或单词）。例如，在句子“I like to eat”中，对于 下一个标记可以是、等。

统计/经典自回归模型规定，输出变量在其自身先前值和随机项（一个不完全可预测的项）上呈线性依赖关系；

这种不完全可预测或随机的术语可以与我们在“我喜欢吃”示例中的下一个预测的标记非常松散地相关，我们通过让模型随机选择下一个标记进行预测（即，等）来帮助模型在选择中不那么确定，我们将在文章后面理解。

由于我们正在实施一个非常基础的自回归模型，我们将从头开始做所有事情，使用一个数据集来生成威廉·莎士比亚风格的文本。这将是我迄今为止最长的文章，所以深呼吸一下，随时可以休息。让我们开始吧！

内容

加载数据 —— 创建数据批量加载器和数据分割器。
双字节语言模型 - 编码语言模型
训练 - 训练模型并生成文本。

注意：本文中的代码遵循 Andrej Karparthy 关于 GPT 的视频。事实上，他的视频是我第一次实现注意力机制的参考，之后我又研究了其他各种架构和关于卷积注意力、移动窗口等的论文。

如果您已经看过这篇文章，除了一些代码更改外，您不会发现太大的区别，所以您可以将其作为快速修订。如果您还没有看过……让我们直接开始吧。

加载数据

# Importing torch specific modules
import torch
import torch.nn as nn
from torch.nn import functional as F

# We start by downloading our shakespeare txt file (stored with the name input.txt)
! wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt */

# reading txt file (encode decode)
text = open('input.txt', 'r',).read()
vocab = sorted(list(set(text)))
encode = lambda s: [vocab.index(c) for c in s]
decode = lambda l: [vocab[c] for c in l]

不使用外部分词器，我们正在实现自定义的lambda函数来对我们的数据执行基于字符级的标记化。

ids = encode("I like to eat")
txt = decode(ids)
print(f"ids: {ids}")
print(f"txt: {txt}")
print(f"".join(txt))

# Output:-
ids: [21, 1, 50, 47, 49, 43, 1, 58, 53, 1, 43, 39, 58]
txt: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'e', 'a', 't']
I like to eat

将数据按90/10分割

x = int(0.9*len(text)) # text is a big string with our entire data
text = torch.tensor(encode(text), dtype=torch.long)
train, val = text[:x], text[x:]

记住，我们在字符级别进行标记化时，也将在字符级别生成，这里明智的方法是创建随机句子的批处理，以输入语料库中的模型进行训练。

batch_size = 32 # batch_size - is how many independent sequences will we process in parallel?
block_size = 8 # block_size = is the maximum context length for predictions

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train if split == 'train' else val  
    ix = torch.randint(len(data) - block_size, (batch_size,)) 
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

xb, yb = get_batch('train')

批次是以下方式创建的...

我们希望从语料库中获取块大小为8的随机句子，因此我们生成批量大小为32的索引(ix)，对于每个索引，我们取下一个8个令牌id并为批量中的每个索引(ix)堆叠，但是我们的目标(y)是使用比x多一个索引生成的 (i+1, i + block_size + 1)，因为我们需要预测序列中的下一个令牌。

例子：

ix  = [33]
for i in ix:
    print(train[i:i+18])
    print(train[i+3:i+18+3]) # I've chosen +3 over +1 only for the sake of example
for i in ix:
    print("".join(decode(train[i:i+18])).replace("\n", ""))
    print("".join(decode(train[i+3:i+18+3])).replace("\n", ""))

# Output:-
tensor([39, 52, 63,  1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1])
tensor([ 1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1])
any further, hear 
 further, hear me

二元语言模型

一个生词是一种n-gram，其中n=2，代表文本中两个连续的词汇单元（如词或字符）的序列。

1-克拉（单词）用于“我喜欢吃”：["我", "喜欢", "吃"]
2-gram（Bigram）for“i like to eat”：["i like", "like to", "to eat"]
3-字母组合（三元组）为“我喜欢吃”: ["我喜欢", "喜欢吃"]

由于我们正在执行一个自回归任务，我们需要以bigram格式加载我们的数据，就像我们在上面的代码块中所写的那样。

现在让我们来谈谈文章的核心——多头注意力。由于我已经在Vision Transformer系列中的十多篇文章中实现了这一部分，我将尽量不浪费您的时间，直接进入概念。（天哪…我本来可以把博客拉长为两部分的，但算了，无论如何我们还是继续吧）.

我们看到GPT是从“注意力就是一切”论文中提出的Transformer架构中借鉴的。然而，它不同之处在于只堆叠了解码器部分的多头注意力。

双字母语言模型。

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, embed_size)
        self.possitional_embedding = nn.Embedding(block_size, embed_size)
        self.linear = nn.Linear(embed_size, vocab_size)
        self.block = nn.Sequential(*[Block(embed_size, num_head) for _ in range(num_layers)])
        self.layer_norm = nn.LayerNorm(embed_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        ps = self.possitional_embedding(torch.arange(T, device=device))
        x = logits + ps    #(B, T, C)
        logits = self.block(x)     #(B, T, c)
        logits = self.linear(self.layer_norm(logits)) # This suppose to map between embed_size and Vocab_size 

        B, T, C = logits.shape
        logits = logits.view(B*T, C)
        targets = targets.view(B*T)
        loss = F.cross_entropy(logits, targets)

        return logits, loss

这里输入 idx 是我们之前生成的批处理，形状为 (B, T)，其中 T 是块大小或令牌长度。

前向传播首先生成每个标记的嵌入形状为（B，T，C）。正如我们在上面的图中看到的，我们需要将位置嵌入添加到标记嵌入中。我们为输入创建嵌入（idx），以便我们可以代表标记保持在固定嵌入维度中的信息，但这并不提供关于标记所在位置的信息，这就是为什么我们必须额外添加位置嵌入以确保模型具有标记位置的上下文。如果您有任何疑问，我想请求您直接参考我的先前博客，在那里我以更详细的方式解释了所有这些内容。

在PyTorch中，nn.Embedding是一种用于将离散的分类值（如单词索引）映射到连续的稠密向量的层。该层将整数索引作为输入，其中每个索引表示一个唯一的分类项（例如，一个单词，标记，甚至其他某些分类数据）。在内部，nn.Embedding维护一个形状为num_embeddings，embedding_dim的嵌入矩阵，以便为每个标记创建一个密集表示。由于我们正在使用一个简化版本的GPT，所以我们直接使用nn.Embedding来生成位置嵌入，而不是采用其他标准方法。

从这里开始，很简单...我们有我们的区块（一叠解码器模块）最后，我们生成新的关注矩阵，与输入具有相同的形状，但每个令牌都包含了有关其之前所有令牌的一些信息。

最后，我们应用Layer Norm（用于稳定训练的常见做法），然后将其传递给线性层，将嵌入C映射到我们的词汇维度。词汇维度就是我们在input.txt中拥有的所有独特字符的数量。定义我们预测的准确性的一般方式之一是将块（注意力模块）的输出与目标索引进行比较。

输出的逻辑是简单地假设是对词汇量 V 的概率分布，用于预测序列中下一个标记（目标）。因此，交叉熵损失用于生成确定我们的输出与目标标记序列的接近程度的损失。

现在我们已经介绍了Bigram的实现。现在是时候看看Block如何使用多头注意力（MHA）来创建注意力度量。

多头注意力

使用多头注意力的原因是我们可以直接将输入（B，T，embedding_size）传递给注意力块，但更快的方法是，不是直接生成Q，K，V并计算维度为embedding_size的注意力权重，而是创建注意力模块的部分，分别计算注意力权重，然后在最后将它们连接起来。

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.head_size = head_size
        self.key = nn.Linear(embed_size, head_size, bias=False)
        self.query = nn.Linear(embed_size, head_size, bias=False)
        self.value = nn.Linear(embed_size, head_size, bias=False)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        wei = q@k.transpose(2, 1)/self.head_size**0.5

Figure 2 : Attention Mechanism | Source Image

因此，按照以上逻辑，我们创建一个单独的注意力头。现在让我们开始理解它。

在添加位置嵌入后，我们有一个维度为（批量大小，标记长度，嵌入维度）的输入。在这里，输入中的每个标记都由一个嵌入维度（64）表示。但是，没有任何标记包含关于其前面所有标记的信息。

为了创建包含丰富信息的嵌入，我们使用注意力机制，通过生成关键（Key）、查询（Query）和值（Value）向量。

在Head类中的注意力机制旨在帮助模型在生成输出时专注于输入序列的不同部分，这在任务如语言建模中特别有用。

关键、查询和值的投影来自于查询与序列中每个标记的上下文相关信息的概念。每个标记由一个向量(x)表示，并通过线性转换它成不同的关键、查询和值向量，我们可以计算序列中哪些标记应该关注彼此。

当q（查询）与k（键）点乘时，结果（wei）告诉我们每个标记与其他每个标记之间的“相关性”或“注意力”得分。较高的得分意味着在该上下文中，一个标记对另一个标记更相关或“重要”。缩放因子1 / sqrt(head_size)防止这些得分变得过大，这可能使softmax分布过于锋利且难以优化。

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.head_size = head_size
        self.key = nn.Linear(embed_size, head_size, bias=False)
        self.query = nn.Linear(embed_size, head_size, bias=False)
        self.value = nn.Linear(embed_size, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        wei = q@k.transpose(2, 1)/self.head_size**0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=2)    # (B , block_size, block_size)
        wei = self.dropout(wei)
        out = wei@v
        return out

因果面罩“tril”被应用以确保每个令牌只能“看到”自身和之前的令牌。这对于像文本生成这样的自回归任务至关重要，其中模型在预测下一个令牌时不应该向未来的令牌望去。将无关的位置设置为负无穷大，并使用masked_fill函数在softmax后将它们置为零，以便它们不会对最终的注意力计算产生影响。这样可以防止模型通过查看未来的令牌来作弊，因为我们希望预测未来的令牌。最后，我们对权重和值的度量进行点积运算，并返回我们的输出。

您可以查看此示例输出以更好地了解发生的变换。

q = torch.randint(10, (1, 3, 3))
v = torch.randint(10, (1, 3, 3))
print("Query:\n",q)
print("Value:\n",v)
wei = q@v.transpose(2, 1)/3**0.5
print("weights:\n", wei)
tril = torch.tril(torch.ones(3, 3))
print("Triangular Metrics:\n",tril)
wei = wei.masked_fill(tril == 0, float('-inf'))
print("Masked Weights\n", wei)
print("Softmax ( e^-inf = 0 )\n", F.softmax(wei, dim=2))

# Output:-
Query:
 tensor([[[2, 8, 8],
         [4, 2, 4],
         [1, 2, 9]]])
Value:
 tensor([[[9, 5, 7],
         [3, 1, 4],
         [6, 2, 9]]])
weights:
 tensor([[[65.8179, 26.5581, 57.7350],
         [42.7239, 17.3205, 36.9504],
         [47.3427, 23.6714, 52.5389]]])
Triangular Metrics:
 tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
Masked Weights
 tensor([[[65.8179,    -inf,    -inf],
         [42.7239, 17.3205,    -inf],
         [47.3427, 23.6714, 52.5389]]])
Softmax ( e^-inf = 0 )
 tensor([[[1.0000e+00, 0.0000e+00, 0.0000e+00],
         [1.0000e+00, 9.2777e-12, 0.0000e+00],
         [5.5073e-03, 2.8880e-13, 9.9449e-01]]])

由于我们正在使用多头注意力，这就是我们将如何实现它：

class MultiHeadAttention(nn.Module):
    def __init__(self, head_size, num_head):
        super().__init__()
        self.sa_head = nn.ModuleList([Head(head_size) for _ in range(num_head)])
        self.dropout = nn.Dropout(dropout)
        self.proj = nn.Linear(embed_size, embed_size)

    def forward(self, x):
        x = torch.cat([head(x) for head in self.sa_head], dim= -1)
        x = self.dropout(self.proj(x))
        return x

在这里，我们将输入 x（B，T，E）传递给不同的注意力头部，每个注意力头部返回大小为（B，T，head_size）的最终向量，其中 head size = E（64）/ num heads（4）= 16，正如我们在一个范围为 num_head（4）的for循环中所做的，我们将它连接回其原始大小（B，T，4*16）。

多头注意力在嵌入维度更大的规模上被简单地认为更快速，更高效。

在连接之后，我们将最终输出传递到线性投影层，这样做的意义在于使最终向量中的嵌入能够在注意力权重计算期间进一步传达彼此学习到的信息。然后将其传递到一个 dropout 层并返回。

将所有内容放在一起，标准解码器块的实现如图1所示。

Figure: 3| Source : Attention is all you need

class FeedForward(nn.Module):
    def __init__(self, embed_size):
        super().__init__()
    
        self.ff = nn.Sequential(
              nn.Linear(embed_size, 4*embed_size),
              nn.ReLU(),
              nn.Linear(4*embed_size, embed_size),
              nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.ff(x)
    
class Block(nn.Module):
    def __init__(self, embed_size, num_head):
        super().__init__()
        head_size = embed_size // num_head
        self.multihead = MultiHeadAttention(head_size, num_head)
        self.ff = FeedForward(embed_size)
        self.ll1 = nn.LayerNorm(embed_size)
        self.ll2 = nn.LayerNorm(embed_size)

    def forward(self, x):
        x = x + self.multihead(self.ll1(x))
        x = x + self.ff(self.ll2(x))
        return x

头部大小是根据先前解释的计算得出的，输入只需传递通过一个层标准化，然后是我们的多头注意力网络，最后再通过一个前馈网络。

回到二元模型

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, embed_size)
        self.possitional_embedding = nn.Embedding(block_size, embed_size)
        self.linear = nn.Linear(embed_size, vocab_size)
        self.block = nn.Sequential(*[Block(embed_size, num_head) for _ in range(num_layers)])
        self.layer_norm = nn.LayerNorm(embed_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        ps = self.possitional_embedding(torch.arange(T, device=device))
        x = logits + ps    #(B, T, C)
        logits = self.block(x)     #(B, T, c)
        logits = self.linear(self.layer_norm(logits)) # This suppose to map between head_size and Vocab_size 
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            crop_idx= idx[:, -block_size:].to(device)
            # get the predictions
            logits, loss = self(crop_idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C) from (B, T, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # We sample one index from the filtered distribution
            idx_next = torch.multinomial(probs, num_samples=1).to(device)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

在这里，我们看到该块通过不同数量的层调用在序贯层下。

生成令牌...

我们首先通过将单维度idx张量（表示我们的令牌）传递给生成函数，同时传递我们想要生成的最大新令牌数量。由于我们的模型是为块大小8构建的，因此我们每次只能传递8个令牌，因此我们裁剪idx中的最后8个令牌（如果idx长度小于块大小，则选择所有令牌）。

我们将作物idx传递给我们的BigramLanguageModel，因为我们生成的logits具有目标令牌的概率分布，我们只对最后一个令牌感兴趣，因为目标（y）的最后一个令牌是序列（x）中的下一个（在批量加载程序部分有解释）。

我们现在拥有的对数是(B，C)的形状，其中C是词汇大小，代表了对最后一个标记索引的整个词汇的概率分布。现在我们只需在其上应用softmax，将此向量转换为概率向量（即元素之和=1）。

现在，请记住，在文章的开头，我们谈到了不完全可预测或随机术语，以及我们如何让模型随机选择序列中的下一个标记是什么？为此，我们使用torch.multinomial，这是一种统计策略，用于从给定的概率分布中取样。在这里，它根据指定的概率随机抽样索引。

然后我们最终得到下一个预测的索引，将其与前一个索引连接起来，然后继续for循环以根据前面的索引生成下一个索引，直到达到最大令牌数。

培训

幸运的是，训练部分非常简单。


m = BigramLanguageModel(65).to(device)

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# training the model, cause I won't give up without a fight 
batch_size = 32
for epoch in range(5000):

    # Printing the Training and Validation Loss
    if epoch%1000==0: 
        m.eval()
        Loss= 0.0
        Val_Loss = 0.0
        for k in range(200):
            x, y = get_batch(True)
            
            val_ , val_loss = m(x, y)
            x1, y1 = get_batch(False)

            _, train_loss = m(x1, y1)            
            Loss += train_loss.item()
            Val_Loss += val_loss.item()
        avg_loss = Val_Loss/(k+1)

        avg_train_loss = Loss/(k+1)
        m.train()
        
        print("Epoch: {} \n The validation loss is:{}    The Loss is:{}".format(epoch, avg_loss, avg_train_loss))
    # Forward
    data, target = get_batch(False)
    logits, loss = m(data, target)
    #Backward
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

在这里，我们将火车训练5000个时代，这在4 GB VRAM Nvidia RTX 3050大约需要大约2分钟。

我们首先进行前向传递，从get_batch()获取我们的批次并将其传递给我们的BigramLanguageModel。设置optimizer.zero_grad()，执行loss.backward()，并执行optimizer.step()，我们使用AdamW优化器，它对我们所需的内容更为充分。

现在是你等待的时刻。。

我们为以下句子创建一个张量:

ids = torch.tensor(encode("i like to eat food"), dtype=torch.long).unsqueeze(0)

ids的形状是(1, 18)（批量大小，标记）。从仅有18个标记（代表词汇表中的索引）开始，生成2000个字符。为了上下文，我们的词汇表是input.txt中所有唯一字符的集合，这在数据加载部分之前已经实现，即vocab = sorted(list(set(text)))。

print("".join(decode(m.generate(torch.zeros([1,1], dtype=torch.long) , max_new_tokens=2000)[0].tolist())))

Output:-

i like to eat food, noBANIO:
Here and
I shake married entreature by colantied at to women oword this swamind-betweet you are
As Grave eare his sun.

LUCENTIO:
Go what a doubled mistressed well.

Taildoes, not to memble, the peashat you;--are master, in thou comsand of the for slawake to bound and to of off this couchio;
Petruchio?
Fece poor this cockepopen neve so it do old loaps islied I'comment and curh
and blate sure poccient you the miad e'er a to partink,
Unory speitied where buzzarr'd formorns,
Pitedame,
Beach, and whom I firit.

ANDO:
O the virtuus a parros that that is acleast, not for suck could mighreature well; thy,
I'll toence after counteent,
Signior to paptista?
Shile you cappier?

BIANCA:
Where womand betire asleck him snall conglithing.

PROSPERO:
I, as expase caspierfed success,
This all no be trutes from the good the island mognied buzent; tensting in this garmortwant;
Do be marriage.

TRANIO:
'Tis, jointer.

GRUCHIO:
Soubt sI'll show I freek born.

PROSPETRUCHIO:
The vant mine; it it