Python 如何使用Tensorflow和Python拆分Unicode字符串并指定字节偏移量？

可以使用‘unicode_split’方法和‘unicode_decode_with_offsets’方法拆分Unicode字符串并指定字节偏移量。这些方法存在于‘tensorflow’模块的‘string’类中。

首先使用Python表示Unicode字符串，并使用Unicode等效项对其进行操作。借助标准字符串操作的Unicode等效项，基于脚本检测将Unicode字符串分隔成令牌。

我们使用Google Colaboratory来运行下面的代码。Google Colab或Colaboratory可以在浏览器上运行Python代码，并且不需要任何配置并且可以免费访问GPU（图形处理单元）。 Colaboratory是基于Jupyter Notebook构建的。

print("Split unicode strings")
tf.strings.unicode_split(thanks, 'UTF-8').numpy()
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')
print("Printing byte offset for characters")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
    print("At byte offset {}: codepoint {}".format(offset, codepoint))

代码来源： https://www.tensorflow.org/tutorials/load_data/unicode

阅读更多：Python 教程

输出

Split unicode strings
Printing byte offset for characters
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

说明

tf.strings.unicode_split操作将Unicode字符串拆分为单个字符的子字符串。
生成的字符张量必须由tf.strings.unicode_decode与原始字符串对齐。
为了达到这个目的，需要知道每个字符开始的偏移量。
方法tf.strings.unicode_decode_with_offsets类似于unicode_decode方法，除了前者返回第二个张量，其中包含每个字符的起始偏移量。