如何使用Python准备Illiad数据集进行训练？

Tensorflow是由Google提供的机器学习框架。它是一个开源框架，与Python一起使用来实现算法、深度学习应用等。它被用于研究和生产目的。

可以使用以下代码行在Windows上安装“tensorflow”包-

pip install tensorflow

张量是TensorFlow中使用的数据结构。它有助于连接流程图中的边缘。这个流程图被称为“数据流图”。张量只是一个多维数组或列表。

我们将使用Illiad数据集，其中包含William Cowper、Edward（Earl of Derby）和Samuel Butler三个翻译作品的文本数据。当给出单行文本时，模型被训练以识别翻译者。使用的文本文件已进行预处理。这包括去除文档标题、页码和章节标题。

我们正在使用Google Colaboratory运行下面的代码。Google Colab或Colaboratory可以在浏览器上运行Python代码，并且需要零配置和免费访问GPU（图形处理器）。Collaboratory是在Jupyter Notebook之上构建的。

更多Python相关文章，请阅读：Python 教程

示例

以下是代码片段-

print("Prepare the dataset for training")
tokenizer = tf_text.UnicodeScriptTokenizer()
print("Defining a function named 'tokenize' to tokenize the text data")
def tokenize(text, unused_label):
   lower_case = tf_text.case_fold_utf8(text)
   return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
print("Iterate over the dataset and print a few samples")
for text_batch in tokenized_ds.take(6):
   print("Tokens: ", text_batch.numpy())

代码来源 – https://www.tensorflow.org/tutorials/load_data/text

输出

准备数据集进行训练
定义名为'tokenize'的函数来对文本数据进行分词
警告：tensorflow：从/usr/local/lib/python3.6/distpackges/tensorflow/python/util/dispatch.py：201开始：batch_gather（from
tensorflow.python.ops.array_ops）已弃用，将于2017-10-25之后被删除。
更新说明：
'tf.batch_gather'已过时，请改用带有'batch_dims=-1'参数的'tf.gather'
遍历数据集并打印几个样本
Tokens：[b'but' b'i' b'have' b'now' b'both' b'tasted' b'food' b',' b'and' b'given']
Tokens：[b'all' b'these' b'shall' b'now' b'be' b'thine' b':' b'but' b'if' b'the'
b'gods']
Tokens：[b'their' b'spiry' b'summits' b'waved' b'.' b'there' b',' b'unperceived']
Tokens：[b'"' b'i' b'pray' b'you' b',' b'would' b'you' b'show' b'your' b'love'
b',' b'dear' b'friends' b',']
Tokens：[b'entering' b'beneath' b'the' b'clavicle' b'the' b'point']
Tokens：[b'but' b'grief' b',' b'his' b'father' b'lost' b',' b'awaits' b'him'
b'now' b',']