Python 迭代器和生成器|极客教程

Python 迭代器和生成器，迭代器是一个对象，它使程序员可以遍历集合的所有元素，而不管其具体实现如何。

在 Python 中，迭代器是实现迭代器协议的对象。迭代器协议包含两种方法。 __iter__()方法必须返回迭代器对象，而next()方法必须返回序列中的下一个元素。

迭代器具有以下优点：

清洁代码
迭代器可以处理无限序列
迭代器节省资源

Python 有几个内置对象，它们实现了迭代器协议。例如列表，元组，字符串，字典或文件。

iterator.py

#!/usr/bin/env python

# iterator.py

str = "formidable"

for e in str:
   print(e, end=" ")

print()

it = iter(str)

print(it.next())
print(it.next())
print(it.next())

print(list(it))

在代码示例中，我们显示了字符串上的内置迭代器。在 Python 中，字符串是不可变的字符序列。 iter()函数返回对象的迭代器。我们还可以在迭代器上使用list()或tuple()函数。

$ ./iterator.py 
f o r m i d a b l e
f
o
r
['m', 'i', 'd', 'a', 'b', 'l', 'e']

Python 阅读一行

通过节省系统资源，我们的意思是在使用迭代器时，我们可以获取序列中的下一个元素，而无需将整个数据集保留在内存中。

read_data.py

#!/usr/bin/env python

# read_data.py

with open('data.txt', 'r') as f:

    while True:

        line = f.readline()

        if not line: 
            break

        else: 
            print(line.rstrip())

此代码打印data.txt文件的内容。除了使用 while 循环外，我们还可以应用迭代器来简化我们的任务。

read_data_iterator.py

#!/usr/bin/env python

# read_data_iterator.py

with open('data.txt', 'r') as f:

    for line in f:
        print(line.rstrip())

open()函数返回一个文件对象，它是一个迭代器。我们可以在 for 循环中使用它。通过使用迭代器，代码更清晰。

Python 迭代器协议

在下面的示例中，我们创建一个实现迭代器协议的自定义对象。

iterator_protocol.py

#!/usr/bin/env python

# iterator_protocol.py

class Seq:

   def __init__(self):

      self.x = 0

   def __next__(self):

      self.x += 1
      return self.x**self.x

   def __iter__(self):

      return self

s = Seq()
n = 0

for e in s:

   print(e)
   n += 1

   if n > 10:
      break

在代码示例中，我们创建了一个数字序列 1，4，27，256，…。这表明使用迭代器，我们可以处理无限序列。

def __iter__(self):

    return self

for 语句在容器对象上调用__iter__()函数。该函数返回一个定义方法__next__()的迭代器对象，该方法一次访问一个容器中的元素。

 def next(self):
    self.x += 1
    return self.x**self.x

next()方法返回序列的下一个元素。

if n > 10:
    break

因为我们正在处理无限序列，所以我们必须中断 for 循环。

$ ./iterator.py 
1
4
27
256
3125
46656
823543
16777216
387420489
10000000000
285311670611

StopIteration

循环可以以其他方式中断。在类定义中，我们必须引发一个StopIteration异常。在以下示例中，我们重做上一个示例。

stopiter.py

#!/usr/bin/env python

# stopiter.py

class Seq14:

   def __init__(self):
      self.x = 0

   def __next__(self):

      self.x += 1

      if self.x > 14:
         raise StopIteration

      return self.x ** self.x

   def __iter__(self):
      return self

s = Seq14()

for e in s:
   print(e)

该代码示例将打印序列的前 14 个数字。

if self.x > 14:
    raise StopIteration

StopIteration异常将停止 for 循环。

$ ./stop_iter.py 
1
4
27
256
3125
46656
823543
16777216
387420489
10000000000
285311670611
8916100448256
302875106592253
11112006825558016

Python 生成器

Generator 是一个特殊的例程，可用于控制循环的迭代行为。生成器类似于返回数组的函数。生成器具有参数，可以调用它并生成数字序列。但是，与返回整个数组的函数不同，生成器一次生成一个值。这需要较少的内存。

Python 中的生成器：

用 def 关键字定义
使用yield关键字
可以使用几个yield关键字
返回一个迭代器

让我们看一个生成器示例。

simple_generator.py

#!/usr/bin/env python

# simple_generator.py

def gen():

   x, y = 1, 2
   yield x, y

   x += 1
   yield x, y

g = gen()

print(next(g))
print(next(g))

try:
   print(next(g))

except StopIteration:
   print("Iteration finished")

该程序将创建一个非常简单的生成器。

def gen():

   x, y = 1, 2
   yield x, y

   x += 1
   yield x, y

就像正常函数一样，生成器使用def关键字定义。我们在生成器主体内使用两个yield关键字。 yield关键字退出生成器并返回一个值。下次调用迭代器的next()函数时，我们在yield关键字之后的行继续。请注意，局部变量会在整个迭代过程中保留。当没有余量可产生时，将引发StopIteration异常。

$ ./generator.py 
(1, 2)
(2, 2)
Iteration finished

在以下示例中，我们计算斐波纳契数。序列的第一个数字为 0，第二个数字为 1，并且每个后续数字等于序列本身前两个数字的和。

fibonacci_gen.py

#!/usr/bin/env python

# fibonacci_gen.py

import time

def fib():

   a, b = 0, 1

   while True:
      yield b

      a, b = b, a + b

g = fib()

try:
   for e in g:
      print(e)

      time.sleep(1)

except KeyboardInterrupt:
   print("Calculation stopped")

该脚本将 Fibonacci 数字连续打印到控制台。用 Ctrl + C 组合键终止。

Python 生成器表达式

生成器表达式类似于列表推导。区别在于生成器表达式返回的是生成器，而不是列表。

generator_expression.py

#!/usr/bin/env python

# generator_expression.py

n = (e for e in range(50000000) if not e % 3)

i = 0

for e in n:
    print(e)

    i += 1

    if i > 100:
        raise StopIteration

该示例计算的值可以除以 3，而没有余数。

n = (e for e in range(50000000) if not e % 3)

将使用圆括号创建生成器表达式。在这种情况下，创建列表推导式将非常低效，因为该示例将不必要地占用大量内存。为此，我们创建了一个生成器表达式，该表达式根据需要延迟生成值。

i = 0

for e in n:
    print(e)

    i += 1

    if i > 100:
        raise StopIteration

在 for 循环中，我们使用生成器生成 100 个值。我们这样做时并没有大量使用内存。

在下一个示例中，我们使用生成器表达式在 Python 中创建类似 grep 的实用程序。

roman_empire.txt

The Roman Empire (Latin: Imperium Rōmānum; Classical Latin: [ɪmˈpɛ.ri.ũː roːˈmaː.nũː] 
Koine and Medieval Greek: Βασιλεία τῶν Ῥωμαίων, tr. Basileia tōn Rhōmaiōn) was the 
post-Roman Republic period of the ancient Roman civilization, characterized by government 
headed by emperors and large territorial holdings around the Mediterranean Sea in Europe, 
Africa and Asia. The city of Rome was the largest city in the world c. 100 BC – c. AD 400, 
with Constantinople (New Rome) becoming the largest around AD 500,[5][6] and the Empire's 
populace grew to an estimated 50 to 90 million inhabitants (roughly 20% of the world's 
population at the time).[n 7][7] The 500-year-old republic which preceded it was severely 
destabilized in a series of civil wars and political conflict, during which Julius Caesar 
was appointed as perpetual dictator and then assassinated in 44 BC. Civil wars and executions 
continued, culminating in the victory of Octavian, Caesar's adopted son, over Mark Antony and 
Cleopatra at the Battle of Actium in 31 BC and the annexation of Egypt. Octavian's power was 
then unassailable and in 27 BC the Roman Senate formally granted him overarching power and 
the new title Augustus, effectively marking the end of the Roman Republic.

我们使用此文本文件。

generator_expression.py

#!/usr/bin/env python

# gen_grep.py

import sys

def grep(pattern, lines):
    return ((line, lines.index(line)+1) for line in lines if pattern in line)

file_name = sys.argv[2]
pattern = sys.argv[1]

with open(file_name, 'r') as f:
    lines = f.readlines()

    for line, n in grep(pattern, lines):
        print(n, line.rstrip())

该示例从文件中读取数据，并打印包含指定模式及其行号的行。

def grep(pattern, lines):
    return ((line, lines.index(line)+1) for line in lines if pattern in line)

类似于 grep 的实用程序使用此生成器表达式。该表达式遍历行列表并选择包含模式的行。它计算列表中该行的索引，即它在文件中的行号。

with open(file_name, 'r') as f:
    lines = f.readlines()

    for line, n in grep(pattern, lines):
        print(n, line.rstrip())

我们打开文件进行读取，然后对数据调用grep()函数。该函数返回一个生成器，该生成器通过 for 循环遍历。

$ ./gen_grep.py Roman roman_empire.txt 
1 The Roman Empire (Latin: Imperium Rōmānum; Classical Latin: [ɪmˈpɛ.ri.ũː roːˈmaː.nũː]
3 post-Roman Republic period of the ancient Roman civilization, characterized by government
13 then unassailable and in 27 BC the Roman Senate formally granted him overarching power and
14 the new title Augustus, effectively marking the end of the Roman Republic.

文件中有四行包含“罗马”字。