Pytorch入门实战:训练一个聊天机器人

本文参考 pytorch 官方 tutorial CHATBOT TUTORIAL

需要掌握的pytorch用法

  1. torch.Tensor 是默认的 tensor 类型 torch.FlaotTensor 的简称。32-bit floating point。

  2. torch.cat((x, x, x), dim=1)
    在给定维度上对输入的张量序列进行连接操作。例如 x shape = (2, 3), 经过上面的变换之后的输出为 shape = (2, 9)

  3. torch.topk(input, k, dim)
    返回按第dim维最大的 k 个值,以及出现的位置。

  4. torch.gather(input, dim, index)
    沿给定轴dim,将输入索引张量index指定位置的值进行聚合。

    1
    2
    3
    4
    a = torch.tensor([[1, 2, 3], [4, 5, 6]])
    index = torch.LongTensor([[0, 1], [2, 0]])
    b = torch.gather(a, dim=1, index=index)
    # b: [[1, 2], [6, 4]]
  5. torch.nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first=False)
    在 rnn 中,将一个填充过的变长序列压紧。输入的形状可以是 (T×B×*)。T 是最长序列长度,B 是 batch size,* 代表任意维度(可以是0)。如果 batch_first=True 的话,那么相应的 input size 就是 (B×T×*)
    输入的序列,应该按序列长度的长短排序,长的在前,短的在后。即 input[:,0] 代表的是最长的序列,input[:, B-1]保存的是最短的序列。
    lengths 表示每个序列的长度。

  6. torch.nn.utils.rnn.pad_packed_sequence(sequence, batch_first=False)
    这个操作和 pack_padded_sequence() 是相反的。把压紧的序列再填充回来。
    返回的 tensor 的 size 是 T×B×*, T 是最长序列的长度,B 是 batch_size,如果 batch_first=True, 那么返回值是 B×T×*。Batch 中的元素将会以它们长度的逆序排列。

  7. torch.masked_select(input, mask, out=None)
    Returns a new 1-D tensor which indexes the input tensor according to the binary mask mask which is a ByteTensor.
    The shapes of the mask tensor and the input tensor don’t need to match, but they must be broadcastable.

  8. torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)
    梯度裁剪

载入并处理数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
%matplotlib inline

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")
device
device(type='cuda')

接下来重新格式化我们的数据文件,并将数据加载到我们可以使用的结构中。

Cornell Movie-Dialogs Corpus是一个丰富的电影角色对话数据集:

  • 包括10292对电影角色之间的220579次会话
  • 617部电影中的9035个角色

这个数据集庞大而多样,语言形式,时间段,情感等都有很大差异。我们希望这种多样性使我们的模型能够适应多种形式的输入和查询。

1
2
3
4
5
6
7
8
9
10
corpus_name = "cornell movie-dialogs corpus"
corpus = os.path.join("data", corpus_name)

def printLines(file, n=10):
with open(file, 'rb') as datafile:
lines = datafile.readlines()
for line in lines[:n]:
print(line)

printLines(os.path.join(corpus, "movie_lines.txt"))
b'L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!\n'
b'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!\n'
b'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.\n'
b'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?\n'
b"L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.\n"
b'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow\n'
b"L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.\n"
b'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No\n'
b'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?\n'
b'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?\n'

创建格式化的数据文件

现在我们来创建格式合适的格式化的数据文件,这个新文件的每一行都包含一个以制表符分隔的查询语句和一个响应语句对。

以下函数用来解析原始 movie_lines.txt 数据文件。

  • loadLines 将文件的每一行拆分为字段字典(lineID,characterID,movieID,character,text)
  • loadConversations 将来自 loadLines 的行字段分组为基于 movie_conversations.txt 的对话
  • extractSentencePairs 从会话中提取一对句子
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Splits each line of the file into a dictionary of fields
def loadLines(fileName, fields):
lines = {}
with open(fileName, 'r', encoding='iso-8859-1') as f:
for line in f:
values = line.split(" +++$+++ ")
# Extract fields
lineObj = {}
for i, field in enumerate(fields):
lineObj[field] = values[i]
lines[lineObj['lineID']] = lineObj
return lines


# Groups fields of lines from `loadLines` into conversations based on *movie_conversations.txt*
def loadConversations(fileName, lines, fields):
conversations = []
with open(fileName, 'r', encoding='iso-8859-1') as f:
for line in f:
values = line.split(" +++$+++ ")
# Extract fields
convObj = {}
for i, field in enumerate(fields):
convObj[field] = values[i]
# Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
lineIds = eval(convObj["utteranceIDs"])
# Reassemble lines
convObj["lines"] = []
for lineId in lineIds:
convObj["lines"].append(lines[lineId])
conversations.append(convObj)
return conversations


# Extracts pairs of sentences from conversations
def extractSentencePairs(conversations):
qa_pairs = []
for conversation in conversations:
# Iterate over all the lines of the conversation
for i in range(len(conversation["lines"]) - 1): # We ignore the last line (no answer for it)
inputLine = conversation["lines"][i]["text"].strip()
targetLine = conversation["lines"][i+1]["text"].strip()
# Filter wrong samples (if one of the lists is empty)
if inputLine and targetLine:
qa_pairs.append([inputLine, targetLine])
return qa_pairs

现在我们调用这些函数并创建新文件,命名为formatted_movie_lines.txt

1
2
3
4
5
6
7
8
9
10
11
12
# Define path to new file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# Initialize lines dict, conversations list, and field ids
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]
1
2
3
# Load lines and process conversations
print("\nProcessing corpus...")
lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
Processing corpus...
1
2
3
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),
lines, MOVIE_CONVERSATIONS_FIELDS)
Loading conversations...
1
2
3
4
5
6
7
8
9
10
# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
writer = csv.writer(outputfile, delimiter=delimiter)
for pair in extractSentencePairs(conversations):
writer.writerow(pair)

# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)
Writing newly formatted file...

Sample lines from file:
b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\r\n"
b'Why?\tUnsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.\r\n'
b"Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.\tThat's a shame.\r\n"
b'Gosh, if only we could find Kat a boyfriend...\tLet me see what I can do.\r\n'

加载并裁剪数据

接下来我们创建一个字典来并且将 query/response 的句子加载到内存中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Default word tokens
PAD_token = 0 # Used for padding short sentences
SOS_token = 1 # Start-of-sentence token
EOS_token = 2 # End-of-sentence token

class Voc:
def __init__(self, name):
self.name = name
self.trimmed = False
self.word2index = {}
self.word2count = {}
self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
self.num_words = 3 # Count SOS, EOS, PAD

def addSentence(self, sentence):
for word in sentence.split(' '):
self.addWord(word)

def addWord(self, word):
if word not in self.word2index:
self.word2index[word] = self.num_words
self.word2count[word] = 1
self.index2word[self.num_words] = word
self.num_words += 1
else:
self.word2count[word] += 1

# Remove words below a certain count threshold
def trim(self, min_count):
if self.trimmed:
return
self.trimmed = True

keep_words = []

for k, v in self.word2count.items():
if v >= min_count:
keep_words.append(k)

print('keep_words {} / {} = {:.4f}'.format(
len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
))

# Reinitialize dictionaries
self.word2index = {}
self.word2count = {}
self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
self.num_words = 3 # Count default tokens

for word in keep_words:
self.addWord(word)

现在我们汇总我们的词汇和查询/回应句子对。 在我们使用这些数据之前,我们必须执行一些预处理。

首先,我们必须使用 unicodeToAsciiUnicode 字符串转换为 ASCII。 接下来,我们应该将所有字母转换为小写并修剪除基本标点符号(normalizeString)之外的所有非字母字符。最后,为了加速训练收敛,我们将过滤掉长度大于MAX_LENGTH阈值的句子(filterPairs)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
MAX_LENGTH = 10  # Maximum sentence length to consider

# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
s = unicodeToAscii(s.lower().strip())
s = re.sub(r"([.!?])", r" \1", s)
s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
s = re.sub(r"\s+", r" ", s).strip()
return s

# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
print("Reading lines...")
# Read the file and split into lines
lines = open(datafile, encoding='utf-8').\
read().strip().split('\n')
# Split every line into pairs and normalize
pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
voc = Voc(corpus_name)
return voc, pairs

# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
# Input sequences need to preserve the last word for EOS token
return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# Filter pairs using filterPair condition
def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]

# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
print("Start preparing training data ...")
voc, pairs = readVocs(datafile, corpus_name)
print("Read {!s} sentence pairs".format(len(pairs)))
pairs = filterPairs(pairs)
print("Trimmed to {!s} sentence pairs".format(len(pairs)))
print("Counting words...")
for pair in pairs:
voc.addSentence(pair[0])
voc.addSentence(pair[1])
print("Counted words:", voc.num_words)
return voc, pairs


# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
print(pair)
Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 64271 sentence pairs
Counting words...
Counted words: 18008

pairs:
['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', 'the real you .']

另一种有利于在训练期间实现更快收敛的策略是修剪我们词汇表中很少使用的单词。减小特征空间也会降低模型学习的难度。我们将通过两个步骤来完成此操作:

  • 使用 voc.trim 函数修剪 MIN_COUNT 阈值下使用的单词
  • 过滤出带有修剪单词的句子对
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
# Trim words used under the MIN_COUNT from the voc
voc.trim(MIN_COUNT)
# Filter out pairs with trimmed words
keep_pairs = []
for pair in pairs:
input_sentence = pair[0]
output_sentence = pair[1]
keep_input = True
keep_output = True
# Check input sentence
for word in input_sentence.split(' '):
if word not in voc.word2index:
keep_input = False
break
# Check output sentence
for word in output_sentence.split(' '):
if word not in voc.word2index:
keep_output = False
break

# Only keep pairs that do not contain trimmed word(s) in their input or output sentence
if keep_input and keep_output:
keep_pairs.append(pair)

print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)
keep_words 7823 / 18005 = 0.4345
Trimmed from 64271 pairs to 53165, 0.8272 of total

为模型准备数据

我们的最终给模型输入的应该是一个数值张量,使用小批量数据进行训练。

而句子长度有长有短,使用小批量也意味着我们必须注意批量中句子长度的变化。为了适应同一批次中不同大小的句子,我们批量输入的张量形状设置为(max_length,batch_size),其中短于 max_length 的句子在之后全部用 EOS_token 填充。

这里读者可以考虑一些为什么我们将张量形状设置为 (max_length,batch_size) 而不是 (batch_size,max_length)(从时间序列的角度考虑)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def indexesFromSentence(voc, sentence):
return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]


def zeroPadding(l, fillvalue=PAD_token):
return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
m = []
for i, seq in enumerate(l):
m.append([])
for token in seq:
if token == PAD_token:
m[i].append(0)
else:
m[i].append(1)
return m

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
padList = zeroPadding(indexes_batch)
padVar = torch.LongTensor(padList)
return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
max_target_len = max([len(indexes) for indexes in indexes_batch])
padList = zeroPadding(indexes_batch)
mask = binaryMatrix(padList)
mask = torch.ByteTensor(mask)
padVar = torch.LongTensor(padList)
return padVar, mask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
input_batch, output_batch = [], []
for pair in pair_batch:
input_batch.append(pair[0])
output_batch.append(pair[1])
inp, lengths = inputVar(input_batch, voc)
output, mask, max_target_len = outputVar(output_batch, voc)
return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)
input_variable: tensor([[ 271,   25,   34,   25, 4164],
        [ 117,  247,    4,  197,  329],
        [3232,  117,  101,  117, 5736],
        [2095,   47,   37,   24,    6],
        [  96,  349,   34,    4,    2],
        [  53,   33,    4,    2,    0],
        [   4,   98,    2,    0,    0],
        [   4,    7,    0,    0,    0],
        [   4,    4,    0,    0,    0],
        [   2,    2,    0,    0,    0]])
lengths: tensor([10, 10,  7,  6,  5])
target_variable: tensor([[  34,   34,  124,  410,  371],
        [ 101,    7,  125,   53,    7],
        [  37,   68,    4,  851,   89],
        [ 479,  274,  124,   47,  534],
        [  96,    4,  125,  371,    4],
        [7435,    2,    4,   40,    2],
        [   4,    0,    2,  170,    0],
        [   2,    0,    0,    6,    0],
        [   0,    0,    0,    2,    0]])
mask: tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 0, 1, 1, 0],
        [1, 0, 0, 1, 0],
        [0, 0, 0, 1, 0]], dtype=torch.uint8)
max_target_len: 9

定义模型

我们使用 seq2seq 模型,它有一个编码器和解码器。在这里,编码器和解码器都使用GRU。

总体模型结构图如下:

iamge

编码器

编码器使用一个双向的GRU。

计算流图如下:

  1. 将单词索引转换为词向量。
  2. 对填充批量序列进行一个pack操作。
  3. 通过GRU前向传播。
  4. unpack操作。
  5. 将双向GRU的输出求和。
  6. 返回输出和最终隐藏状态。

输入:

  • input_seq:一个batchshape =(max_length,batch_size)
  • input_lengths:对应批量中每个句子的长度, shape=(batch_size,)
  • hidden:隐藏状态,shape =(n_layers × num_directions,batch_size,hidden_size)

输出:

  • outputGRU 最后一个隐藏层的输出特征(双向输出之和),shape =(max_length,batch_size,hidden_size)
  • hidden:从GRU更新的隐藏状态; shape =(n_layers x num_directions,batch_size,hidden_size)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class EncoderRNN(nn.Module):
def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
super(EncoderRNN, self).__init__()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.embedding = embedding

# Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
# because our input size is a word embedding with number of features == hidden_size
self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

def forward(self, input_seq, input_lengths, hidden=None):
# Convert word indexes to embeddings
embedded = self.embedding(input_seq)
# Pack padded batch of sequences for RNN module
packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
# Forward pass through GRU
outputs, hidden = self.gru(packed, hidden)
# Unpack padding
outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)

# Sum bidirectional GRU outputs
outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
# Return output and final hidden state
return outputs, hidden

解码器

编码器的输出的张量中包含了输入句子的信息,我们使用解码器中的隐藏状态和编码器的输出共同产生序列中的下一个词,直到输出一个 EOS_token。一般的 seq2seq 解码器的一个常见问题是,如果我们依赖于上下文向量(就是解码器的输出)来编码整个输入序列的含义,那么我们很可能会丢失信息。在处理长输入序列时尤为明显,这极大地限制了的解码器的能力。

一个解决方案是使用注意力机制。详细的注意力机制模型解释请移步注意力机制

我们实现三种计算注意力机制中 score 的方法:

下面是注意力机制的实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Luong attention layer
class Attn(torch.nn.Module):
def __init__(self, method, hidden_size):
super(Attn, self).__init__()
self.method = method
if self.method not in ['dot', 'general', 'concat']:
raise ValueError(self.method, "is not an appropriate attention method.")
self.hidden_size = hidden_size
if self.method == 'general':
self.attn = torch.nn.Linear(self.hidden_size, hidden_size)
elif self.method == 'concat':
self.attn = torch.nn.Linear(self.hidden_size * 2, hidden_size)
self.v = torch.nn.Parameter(torch.FloatTensor(hidden_size))

def dot_score(self, hidden, encoder_output):
return torch.sum(hidden * encoder_output, dim=2)

def general_score(self, hidden, encoder_output):
energy = self.attn(encoder_output)
return torch.sum(hidden * energy, dim=2)

def concat_score(self, hidden, encoder_output):
energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
return torch.sum(self.v * energy, dim=2)

def forward(self, hidden, encoder_outputs):
# Calculate the attention weights (energies) based on the given method
if self.method == 'general':
attn_energies = self.general_score(hidden, encoder_outputs)
elif self.method == 'concat':
attn_energies = self.concat_score(hidden, encoder_outputs)
elif self.method == 'dot':
attn_energies = self.dot_score(hidden, encoder_outputs)

# Transpose max_length and batch_size dimensions
attn_energies = attn_energies.t()

# attn_energies shape: batch_size × time_step
# 例如 5 × 10,表示 encoder 的 10 个 time step 对每个 batch 的 energy
# 接下来做一个 softamx,注意是在 dim=1 的维度上做,然后再加一个维度 返回的 shape: batch_size × 1 × time_step
return F.softmax(attn_energies, dim=1).unsqueeze(1)

下面我们实现解码器模型。 对于解码器,我们每次提供一个batch的数据。 这意味着我们的词嵌入张量和GRU输出都具有形状(1,batch_size,hidden_size)

计算图:

  1. 获取当前输入的词向量。
  2. 通过GRU前向传播。
  3. 根据GRU输出计算Attention。
  4. 将Attention权重乘以编码器输出以获得新的“加权和”上下文向量。
  5. 连接加权上下文向量和GRU输出。
  6. 经过两个线性层和一个softmax预测下一个单词
  7. 返回输出和隐藏状态。

输入:

  • input_step:输入序列的一个时间步, shape =(1,batch_size)
  • last_hidden:GRU的最后隐藏层,shape =(n_layers x num_directions,batch_size,hidden_size)
  • encoder_outputs:编码器的输出,shape =(time_steps,batch_size,hidden_size)

输出:

  • output:softmax归一化张量,给出每个字的概率是解码序列中正确的下一个字; shape =(batch_size,voc.num_words)

  • hidden:GRU的最终隐藏状态; shape =(n_layers x num_directions,batch_size,hidden_size)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class LuongAttnDecoderRNN(nn.Module):
def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
super(LuongAttnDecoderRNN, self).__init__()

# Keep for reference
self.attn_model = attn_model
self.hidden_size = hidden_size
self.output_size = output_size
self.n_layers = n_layers
self.dropout = dropout

# Define layers
self.embedding = embedding
self.embedding_dropout = nn.Dropout(dropout)
self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
self.concat = nn.Linear(hidden_size * 2, hidden_size)
self.out = nn.Linear(hidden_size, output_size)

self.attn = Attn(attn_model, hidden_size)

def forward(self, input_step, last_hidden, encoder_outputs, debug=False):
# Note: we run this one step (word) at a time
# Get embedding of current input word
# encoder_outputs shape: time_steps × batch_size × embedded_dim, example: 10 × 5 × 500
embedded = self.embedding(input_step)
embedded = self.embedding_dropout(embedded)
# Forward through unidirectional GRU
# rnn_output shape: 1 × batch_size × embedded_dim, example: 1 × 5 × 500
# hidden shape: hidden_layer × batch_size × embedded_dim, example: 2 × 5 × 500
rnn_output, hidden = self.gru(embedded, last_hidden)
# Calculate attention weights from the current GRU output
# attn_weights shape: batch_size × 1 × time_steps, example: 5 × 1 × 10
attn_weights = self.attn(rnn_output, encoder_outputs)
# Multiply attention weights to encoder outputs to get new "weighted sum" context vector
# context shape: (batch_size, 1, embedded_dim), example: 5 × 1 × 500
context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
# Concatenate weighted context vector and GRU output using Luong eq. 5
# rnn_output shape: batch_size × embedded_dim, example: 5 × 500
rnn_output = rnn_output.squeeze(0)
# content shape:batch_size × embedded_dim, example: 5 × 500
context = context.squeeze(1)
# concat_input shape: batch_size × (2 × embedded_dim), example: 5 × 1000
concat_input = torch.cat((rnn_output, context), 1)
# concat_output shape: batch_size × embedded_dim, example: 5 × 500
concat_output = torch.tanh(self.concat(concat_input))
# Predict next word using Luong eq. 6
# output shape: batch_size × voc.num_words, example: 5 × 7823
output = self.out(concat_output)
output = F.softmax(output, dim=1)
# Return output and final hidden state
return output, hidden

定义训练步骤

损失计算

我们的输出是加了pad字符的输入,计算loss时,我们要不pad字符产生的loss排除在外。单独定义一个 makeNLLLoss 来计算。

1
2
3
4
5
6
7
8
def maskNLLLoss(inp, target, mask):
# inp: (batch_size, voc.num_words) example: (5, 7823)
# target shape == mask shape = (5, )
nTotal = mask.sum()
crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)))
loss = crossEntropy.masked_select(mask).mean()
loss = loss.to(device)
return loss, nTotal.item()

单批次训练

训练的时候我们的编码器输入是一个 mini-batch,但解码器的输入是一个 single-batch。

我们使用一些技巧来帮助收敛:

  • 第一个技巧是使用 teacher forcing。 这意味着,我们以一定的概率使用当前目标单词作为解码器的下一个输入,而不是使用解码器的上一个预测输出。 该技术有助于更有效的训练。 然而,teacher forcing 可能导致预测期间的模型不稳定,因为解码器可能没有足够的机会在训练期间真正的使用自己的输出序列。 因此,我们必须注意我们如何设置 teacher_forcing_ratio,而不是被快速收敛所迷惑。

  • 第二个技巧是梯度裁剪。 这是一种用于对抗“爆炸梯度”问题的常用技术。 本质上,通过将梯度剪切或阈值化到最大值,可以防止梯度以指数方式增长并且在代价函数中溢出(NaN)。

操作顺序:

  1. 将整个批次通过编码器前向传播。
  2. 将解码器的输入初始化为 SOS_token,将隐藏状态初始化为编码器的最后时间步的隐藏状态。
  3. 通过解码器前向传播,一次只处理一个时间步的数据。
  4. 如果使用 teacher forcing:将下一个解码器输入设置为当前目标; 否则:将下一个解码器输入设置为当前解码器输出。
  5. 计算并累积损失。
  6. 执行反向传播。
  7. 梯度裁剪。
  8. 更新编码器和解码器模型参数。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

# target_variable shape: (time_steps, batch_size) example: (10, 5)
# mask: (time_steps, batch_size) example: (10, 5)
# Zero gradients
encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()

# Set device options
input_variable = input_variable.to(device)
lengths = lengths.to(device)
target_variable = target_variable.to(device)
mask = mask.to(device)

# Initialize variables
loss = 0
print_losses = []
n_totals = 0

# Forward pass through encoder
# encoder_outputs: (time_steps, batch_size, embedding_dim), example: (10, 5, 500)
# encoder_hidden: (hidden_layers, batch_size, embedding_dim), example: (2, 5, 500)
encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

# Create initial decoder input (start with SOS tokens for each sentence)
# decoder_input: (1, batch_size)
decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
decoder_input = decoder_input.to(device)

# Set initial decoder hidden state to the encoder's final hidden state
# decoder_hidden: (decoder.n_layers, batch_size, embedding_dim), example: (1, 5, 500)
decoder_hidden = encoder_hidden[:decoder.n_layers]

# Determine if we are using teacher forcing this iteration
use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

# Forward batch of sequences through decoder one time step at a time
if use_teacher_forcing:
for t in range(max_target_len):
# decoder_output shape: (batch_size, voc.num_words), example: (5, 7823)
# decoder_hidden shape: (hidden_layers, batch_size, embdding_dim), example: (2, 5, 500)
decoder_output, decoder_hidden = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
# Teacher forcing: next input is current target
# decoder_input shape: (1, batch_size) example: (1, 5)
decoder_input = target_variable[t].view(1, -1)
# Calculate and accumulate loss
# mask_loss: 平均loss
# nTotal: 有效计算的数目
mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
loss += mask_loss
print_losses.append(mask_loss.item() * nTotal)
n_totals += nTotal
else:
for t in range(max_target_len):
decoder_output, decoder_hidden = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
# No teacher forcing: next input is decoder's own current output
_, topi = decoder_output.topk(1)
decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
decoder_input = decoder_input.to(device)
# Calculate and accumulate loss
mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
loss += mask_loss
print_losses.append(mask_loss.item() * nTotal)
n_totals += nTotal

# Perform backpropatation
loss.backward()

# Clip gradients: gradients are modified in place
_ = torch.nn.utils.clip_grad_norm_(encoder.parameters(), clip)
_ = torch.nn.utils.clip_grad_norm_(decoder.parameters(), clip)

# Adjust model weights
encoder_optimizer.step()
decoder_optimizer.step()

return sum(print_losses) / n_totals

迭代训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, clip, corpus_name, loadFilename):

# Load batches for each iteration
training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
for _ in range(n_iteration)]

# Initializations
print('Initializing ...')
start_iteration = 1
print_loss = 0
if loadFilename:
start_iteration = checkpoint['iteration'] + 1

# Training loop
print("Training...")
for iteration in range(start_iteration, n_iteration + 1):
training_batch = training_batches[iteration - 1]
# Extract fields from batch
# input_variable: (time_steps, batch_size) example: (10, 5)
# lengths: (batch_size,) example: (5, )
# target_variable: (time_steps, batch_size) example: (10, 5)
# mask: (time_steps, batch_szie) example: (10, 5)
# max_target_len: constant example: 10
input_variable, lengths, target_variable, mask, max_target_len = training_batch

# Run a training iteration with batch
# loss: average loss
loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
print_loss += loss

# Print progress
if iteration % print_every == 0:
print_loss_avg = print_loss / print_every
print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))
print_loss = 0

# Save checkpoint
if (iteration % save_every == 0):
directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))
if not os.path.exists(directory):
os.makedirs(directory)
torch.save({
'iteration': iteration,
'en': encoder.state_dict(),
'de': decoder.state_dict(),
'en_opt': encoder_optimizer.state_dict(),
'de_opt': decoder_optimizer.state_dict(),
'loss': loss,
'voc_dict': voc.__dict__,
'embedding': embedding.state_dict()
}, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))

定义评估方法

在训练模型后,我们希望能够自己与机器人交谈。 首先,我们必须定义我们希望模型如何解码编码我们的输入。

Greedy decoding

Greedy decoding 是我们在训练期间不使用 teacher forcing 时使用的解码方法。换句话说,对于每个时间步,我们只需从具有最高 softmax 值的 decoder_output 中选择单词。该解码方法在单个时间步长上是最优选择。

为了方便 Greedy decoding 的解码操作,我们定义了一个 GreedySearchDecoder 类。 当运行时,该类的一个对象接受shape为 (input_seq length,1)的输入序列,一个input_length的标量输入和一个约束相应句子长度的 max_length

计算图:

  1. 通过编码器模型前向传播输入。
  2. 将编码器的最后时间步的隐藏层状态作为解码器的第一个隐藏层输入。
  3. 将解码器的第一个输入初始化为 SOS_token
  4. 初始化要append的张量。
  5. 一次迭代解码一个单词:
    1. 正向通过解码器。
    2. 获得最可能的单词标记及其softmax分数。
    3. 记录单词标记和分数。
    4. 准备当前单词标记作为下一次解码器输入。
    5. 返回单词标记和分数的集合。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class GreedySearchDecoder(nn.Module):
def __init__(self, encoder, decoder):
super(GreedySearchDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder

def forward(self, input_seq, input_length, max_length):
# Forward input through encoder model
encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
# Prepare encoder's final hidden layer to be first hidden input to the decoder
decoder_hidden = encoder_hidden[:decoder.n_layers]
# Initialize decoder input with SOS_token
decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
# Initialize tensors to append decoded words to
all_tokens = torch.zeros([0], device=device, dtype=torch.long)
all_scores = torch.zeros([0], device=device)
# Iteratively decode one word token at a time
for _ in range(max_length):
# Forward pass through decoder
decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
# Obtain most likely word token and its softmax score
decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
# Record token and score
all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
all_scores = torch.cat((all_scores, decoder_scores), dim=0)
# Prepare current token to be next decoder input (add a dimension)
decoder_input = torch.unsqueeze(decoder_input, 0)
# Return collections of word tokens and scores
return all_tokens, all_scores

用自己的输入评估

现在我们已经定义了解码方法,我们可以编写用于评估字符串输入句子的函数。 evaluate 函数管理处理输入句子的低级过程。我们首先使用 batch_size == 1 将句子格式化为输入批量的单词索引。 我们将句子的单词转换为相应的索引,并转换维度来为我们的模型准备张量输入。 我们还创建了一个长度张量,其中包含输入句子的长度。在这种情况下,长度是标量,因为我们一次只评估一个句子(batch_size == 1)。 接下来,我们使用我们的 GreedySearchDecoder 对象获得 decoder 的响应句子张量。最后,我们将响应的索引转换为单词并返回已解码单词的列表。

evaluateInput 充当聊天机器人的用户界面。 调用时,将生成一个输入文本字段,我们可以在其中输入询问语句。 在输入我们的输入句子并按 Enter 后,文本以与训练数据相同的方式标准化,并最终被输入到评估函数以获得 decoder 的输出句子。 我们循环这个过程,直到我们输入“q”或“quit”。

最后,如果输入的句子包含不在词汇表中的单词,我们会通过打印错误消息并提示用户输入另一个句子来优雅地处理此问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
### Format input sentence as a batch
# words -> indexes
indexes_batch = [indexesFromSentence(voc, sentence)]
# Create lengths tensor
lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
# Transpose dimensions of batch to match models' expectations
input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
# Use appropriate device
input_batch = input_batch.to(device)
lengths = lengths.to(device)
# Decode sentence with searcher
tokens, scores = searcher(input_batch, lengths, max_length)
# indexes -> words
decoded_words = [voc.index2word[token.item()] for token in tokens]
return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
input_sentence = ''
while(1):
try:
# Get input sentence
input_sentence = input('> ')
# Check if it is quit case
if input_sentence == 'q' or input_sentence == 'quit': break
# Normalize sentence
input_sentence = normalizeString(input_sentence)
# Evaluate sentence
output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
# Format and print response sentence
output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
print('Bot:', ' '.join(output_words))

except KeyError:
print("Error: Encountered unknown word.")

运行模型

无论我们是训练还是测试聊天机器人模型,我们都必须初始化各个编码器和解码器模型。 在下面的块中,我们设置了所需的配置,选择从头开始训练活着从 checkpoint 中加载,以及构建和初始化模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Configure models
model_name = 'cb_model'
attn_model = 'dot'
#attn_model = 'general'
#attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
#loadFilename = os.path.join(save_dir, model_name, corpus_name,
# '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
# '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
# If loading on same machine the model was trained on
checkpoint = torch.load(loadFilename)
# If loading a model trained on GPU to CPU
#checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
encoder_sd = checkpoint['en']
decoder_sd = checkpoint['de']
encoder_optimizer_sd = checkpoint['en_opt']
decoder_optimizer_sd = checkpoint['de_opt']
embedding_sd = checkpoint['embedding']
voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
encoder.load_state_dict(encoder_sd)
decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')
Building encoder and decoder ...
Models built and ready to go!

训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Configure training/optimization
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 200
save_every = 500

# Ensure dropout layers are in train mode
encoder.train()
decoder.train()

# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
encoder_optimizer.load_state_dict(encoder_optimizer_sd)
decoder_optimizer.load_state_dict(decoder_optimizer_sd)

# Run training iterations
print("Starting Training!")
trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
print_every, save_every, clip, corpus_name, loadFilename)
Building optimizers ...
Starting Training!
Initializing ...
Training...
Iteration: 200; Percent complete: 5.0%; Average loss: 4.0215
Iteration: 400; Percent complete: 10.0%; Average loss: 3.3358
Iteration: 600; Percent complete: 15.0%; Average loss: 3.1570
Iteration: 800; Percent complete: 20.0%; Average loss: 3.0571
Iteration: 1000; Percent complete: 25.0%; Average loss: 2.9542
Iteration: 1200; Percent complete: 30.0%; Average loss: 2.9113
Iteration: 1400; Percent complete: 35.0%; Average loss: 2.8514
Iteration: 1600; Percent complete: 40.0%; Average loss: 2.7897
Iteration: 1800; Percent complete: 45.0%; Average loss: 2.7300
Iteration: 2000; Percent complete: 50.0%; Average loss: 2.6705
Iteration: 2200; Percent complete: 55.0%; Average loss: 2.6352
Iteration: 2400; Percent complete: 60.0%; Average loss: 2.5717
Iteration: 2600; Percent complete: 65.0%; Average loss: 2.5346
Iteration: 2800; Percent complete: 70.0%; Average loss: 2.4772
Iteration: 3000; Percent complete: 75.0%; Average loss: 2.4238
Iteration: 3200; Percent complete: 80.0%; Average loss: 2.3706
Iteration: 3400; Percent complete: 85.0%; Average loss: 2.3249
Iteration: 3600; Percent complete: 90.0%; Average loss: 2.2789
Iteration: 3800; Percent complete: 95.0%; Average loss: 2.2169
Iteration: 4000; Percent complete: 100.0%; Average loss: 2.1741

评估

1
2
3
4
5
6
7
8
9
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
# evaluateInput(encoder, decoder, searcher, voc)
持续技术分享,您的支持将鼓励我继续创作!