如何用Python下载和探索Iliad数据集

Tensorflow 是一个免费的开源机器学习和人工智能库，广泛流行于训练和部署神经网络。它是由谷歌大脑团队开发的，支持广泛的平台。在本教程中，我们将学习下载、加载和探索著名的Iliad数据集。

在Iliad数据集中，有不同的作品，对同一本荷马的伊利亚特文本有不同的英文翻译。Tensorflow对这些文件进行了修改，以关注其作品的例子。该数据集可在以下网址获得.

https://storage.googleapis.com/download.tensorflow.org/data/illiad/

示例：在下面的例子中，我们将以三位译者的作品为例。威廉-考伯，爱德华，德布伯爵，和塞缪尔-巴特勒。然后在TensorFlow的帮助下，我们将加载他们，并将他们的作品与他们的翻译进行分类。

安装TensorFlow文本包：

pip install "tensorflow-text==2.8.*"

下载并加载Iliad数据集

我们需要给每个数据集单独贴标签，因此我们使用Dataset.map函数。这将返回例子-标签对。

import pathlib
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
  
print("Welcome to GeeksforGeeks")
print("Loading the Illiad dataset")
DIRECTORY_URL = 'https://storage.googleapis.com/\
download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']
  
for name in FILE_NAMES:
   text_dir = utils.get_file(name,
                             origin=DIRECTORY_URL + name)
  
parent_dir = pathlib.Path(text_dir).parent
  
def labeler(example, index):
  return example, tf.cast(index, tf.int64)
  
labeled_data_sets = []
  
for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)
labeled_data_sets

输出:

[<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None)>,   
<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf. string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,   
<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None)>]

连接并洗刷数据集。使用Dataset.concatenate函数对其进行连接。shuffle函数被用来洗数据。然后我们打印出一些例子。

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
  
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
  
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)
for text, label in all_labeled_data.take(5):
    print("Sentence: ", text.numpy())
    print("Label:", label.numpy())

输出:

Sentence:  b”Of brass, and color’d with a ring of gold.”
Label: 0
Sentence:  b’drove the horses in among the others.’
Label: 2
Sentence:  b’Into the boundless ether. Reaching soon’
Label: 0
Sentence:  b”Drive to the ships, for pain weigh’d down his soul.”
Label: 1
Sentence:  b”Not one is station’d to protect the camp.”
Label: 1