在Python中使用NLP的基于LSTM的诗歌生成

在对话式人工智能中，人们要完成的主要任务之一是自然语言生成（NLG），指的是采用模型来生成自然语言。在这篇文章中，我们将通过建立一个基于LSTM的诗歌生成器来实现NLG。

数据集

用于建立模型的数据集是从Kaggle获得的。该数据集是一个由许多诗人写的诗的汇编，以文本文件的形式存在。我们可以很容易地使用这些数据来生成嵌入，并随后训练LSTM模型。你可以在这里找到该数据集。

下面是数据集的摘录。

构建文本生成器

文本生成器可以通过以下简单步骤建立。

第1步。导入必要的库

首先，我们需要导入必要的库。我们将使用TensorFlow和Keras来构建双向LSTM。

如果上述任何一个库没有安装，那么只需在终端用pip install [package-name] 命令安装它。

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow.keras.utils as ku 
from wordcloud import WordCloud
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers

第2步。加载数据集和探索性数据分析。

现在，我们将使用pandas加载我们的数据集。此外，我们需要进行一些探索性的数据分析，以便我们更好地了解我们的数据。由于我们处理的是文本数据，最好的方法是生成一个词云。

# Reading the text data file
data = open('poem.txt', encoding="utf8").read()
  
# EDA: Generating WordCloud to visualize
# the text
wordcloud = WordCloud(max_font_size=50,
                      max_words=100,
                      background_color="black").generate(data)
  
# Plotting the WordCloud
plt.figure(figsize=(8, 4))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig("WordCloud.png")
plt.show()

输出 :

在Python中使用NLP的基于LSTM的诗歌生成

第3步。创建语料库

现在，我们的所有数据都存在于这个庞大的文本文件中。然而，我们不建议将所有的数据全部输入我们的模式，因为这将导致较低的准确性。因此，我们将把我们的文本分成几行，以便我们可以用它们来为我们的模型生成文本嵌入。

# Generating the corpus by 
# splitting the text into lines
corpus = data.lower().split("\n")
print(corpus[:10])

输出 :

['stay, i said',
 'to the cut flowers.',
 'they bowed',
 'their heads lower.',
 'stay, i said to the spider,',
 'who fled.',
 'stay, leaf.',
 'it reddened,',
 'embarrassed for me and itself.',
 'stay, i said to my body.']

第4步。在语料库上装配标记器。

为了以后生成嵌入，我们需要在整个语料库上安装一个TensorFlow Tokenizer，这样它就能学习到词汇。

# Fitting the Tokenizer on the Corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
  
# Vocabulary count of the corpus
total_words = len(tokenizer.word_index)
  
print("Total Words:", total_words)

输出 :

Total Words: 3807

第5步。生成嵌入/矢量化

现在我们将为我们的语料库中的每个句子生成嵌入。嵌入是我们文本的矢量表示。由于我们不能用非结构化的文本喂养机器/深度学习模型，这是一个必须的步骤。首先，我们使用Keras的text_to_sequence()函数将每个句子转换为嵌入。然后，我们计算最长的嵌入的长度；最后，我们用零来填充所有的嵌入到最大的长度，以确保嵌入的长度相等。

# Converting the text into embeddings
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
  
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)
  
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,
                                         maxlen=max_sequence_len,
                                         padding='pre'))
predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
label = ku.to_categorical(label, num_classes=total_words+1)

我们的文本嵌入会是这样的：

array([[   0,    0,    0, …,    0,    0,  266],

       [   0,    0,    0, …,    0,  266,    3],

       [   0,    0,    0, …,    0,    0,    4],

       …,

       [   0,    0,    0, …,    8, 3807,   15],

       [   0,    0,    0, …, 3807,   15,    4],

       [   0,    0,    0, …,   15,    4,  203]], dtype=int32)

第6步。建立双向LSTM模型。

现在，我们已经完成了所有的预处理步骤，这些步骤是为了将文本输入我们的模型而需要的。现在是时候开始建立模型了。由于这是一个文本生成的用例，我们将创建一个双向的LSTM模型，因为意义在这里起着重要的作用。

# Building a Bi-Directional LSTM Model
model = Sequential()
model.add(Embedding(total_words+1, 100, 
                    input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words+1/2, activation='relu',
                kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words+1, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])
print(model.summary())

该模型的摘要如下。

Model: “sequential”

Layer (type) Output Shape Param #

=================================================================

embedding (Embedding) (None, 15, 100) 380800

bidirectional (Bidirectiona (None, 15, 300) 301200

l)

dropout (Dropout) (None, 15, 300) 0

lstm_1 (LSTM) (None, 100) 160400

dense (Dense) (None, 3807) 384507

dense_1 (Dense) (None, 3808) 14500864

=================================================================

Total params: 15,727,771

Trainable params: 15,727,771

Non-trainable params: 0

None

该模型将采用基于下一个单词预测的方法，其中我们将输入一个种子文本，该模型将通过预测随后的单词来生成诗歌。这就是为什么我们使用Softmax激活函数的原因，该函数通常用于多类分类用例。

第7步。模型训练

在建立了模型架构之后，我们现在要在预处理过的文本上对其进行训练。在这里，我们对150个纪元的模型进行了训练。

history = model.fit(predictors, label, epochs=150, verbose=1)

最后几个训练纪元显示如下。

Epoch 145/150

510/510 [==============================] – 132s 258ms/step – loss: 3.3349 – accuracy: 0.8555

Epoch 146/150

510/510 [==============================] – 130s 254ms/step – loss: 3.2653 – accuracy: 0.8561

Epoch 147/150

510/510 [==============================] – 129s 253ms/step – loss: 3.1789 – accuracy: 0.8696

Epoch 148/150

510/510 [==============================] – 127s 250ms/step – loss: 3.1063 – accuracy: 0.8727

Epoch 149/150

510/510 [==============================] – 128s 251ms/step – loss: 3.0314 – accuracy: 0.8787

Epoch 150/150

我们看到，已经获得了87%的准确率，这是相当不错的成绩。

It is recommended that you train the model on a GPU enabled machine. If your systems happens to not have a GPU, you can make use of Google Colab or Kaggle notebooks.

第8步。使用建立的模型生成文本

在最后一步，我们将使用我们的模型生成诗歌。如前所述，该模型是基于下一个词的预测方法–因此，我们需要向该模型提供一些种子文本。

seed_text = "The world"
next_words = 25
ouptut_text = ""
  
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences(
        [token_list], maxlen=max_sequence_len-1,
      padding='pre')
    predicted = np.argmax(model.predict(token_list, 
                                        verbose=0), axis=-1)
    output_word = ""
      
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
              
    seed_text += " " + output_word
      
print(seed_text)

输出 :