Python generator vs iterator

Question:

So I have custom iterator which represents as a class CustomIterator below. The last print shows a size of all data which it uses when gets one char from some string. 1170 bytes.

class CustomIterator:
    def __init__(self, collection: str):
        self.__position = 0
        self.__collection = collection

    def __iter__(self):  
        return self

    def __next__(self):   
        while self.__position < len(self.__collection):  
            char = self.__collection[self.__position]    
            self.__position += 1                         
            return char.upper()                         
        raise StopIteration                            

iterator = CustomIterator(s)                            
print(f'{sys.getsizeof(CustomIterator) + sys.getsizeof(iterator) + sys.getsizeof(next(iterator))}') #1170

Also I have a generator which represents as a function with yield operator bellow. The last print here means the same as for iterator. 154 bytes.

#Generator
def generator(s: str):
    for char in s:
        yield char.upper()

g = generator(s)
print(f'{sys.getsizeof(generator(s)) + sys.getsizeof(next(g))}') #154

The both code makes the same results. So how exactly is working yield operator in Python? I supposed it inherit method next from base iterator and override it. Is it right? If so it must be the same resource requirements for both code?

I was trying to find the answer in Docs and some articles in google

Asked By: Leo

||

Answers:

You are correct in several aspects, including when you say "If so it must be the same resource requirements for both code?" – yes, both forms will use basically the same resources, and whether one happens to be more performant or not is actually more due to implementation details than any fundamental thing.

For example, since pure Python iterators written as a class implementing __next__ require Python function calls, they are likely slower than using a generator-function with yield up to Python 3.10, but not necessarily on Python 3.11, where the overhead for function calling was reduced. OTOH, the pipy implementation should not have a difference between Python code written by the user in a __next__ method and the internal code executed by the runtime for generator functions.

In an ideal world, they would perform the same. And the "big O" algorithmic factors for both forms is certainly the same in any (reasonable) Python implementation.

This leaves us with the form differences:
Indeed, when one write a function which includes the yield keyword in its body (even if it is in an unreachable section of the code), the function is, at compile time, created as a "generator function": this means that when it is called, none of the visible code in it is executed. Instead, Python create an instance of a "generator object" which is returned. This object features the methods __next__, send and throw, which can be subsequently used to drive the generator:
in this sense a generator works the same as a user-implemented iterator.

As for output of sys.getsizeof this is certainly a thing that should not concern you. The output of this function is not a reliable metric, as it won’t display the values of any referenced objects. An instance of a user class will typically have an associated full size dictionary, for example (although this has also been optimized in recent cPython releases). All in all, the difference for the total bytes used for a generator created by a generator function, and an iterator creatd by a user class, might be even of a couple hundred bytes in favor of the generator function one: but this won’t make any difference in most workflows, unless one is creating hundreds (and for large server processes, tens of thousands) of generators to be used in parallel (i.e. creating new ones before older ones had been per-used and removed from memory).

And even them, the user class could be optimized (with the use of __slots__ and other techniques).

In your comparison, in particular:


print(f'{sys.getsizeof(CustomIterator) + sys.getsizeof(iterator) + sys.getsizeof(next(iterator))}') #1170

You are getting the size of the class object itself – it is an instane of type, and certainly will use some more memory (sys.getsizeof(CustomIterator) ) – the 1000 extra bytes you see are not much: this amount is created exactly once (*), and remains used for the lifetime of your process. Each iterator instance will use another amount of memory, which will be freed when the iterator is no longer used.

As for the internal state of a generator-function created generator, which seems to be the other thing that concerns you: it is of course not magic – it is maintained in an object that is even introspectable called an "execution frame". When you call a generator function, the returned "generator object"
has the .gi_frame attribute, and you can inspect the internal local variables at .gi_frame.f_locals. The same state keeping takes place, in a nested way, when you run for char in s: . The difference there is that the for statement creates an iterator over s which is not directly accessible from Python code. But you could do: iter_s = iter(s); for char in iter_s, and see some of the state you want in the iter_s object (this won’t expose internal states, like the variable used as a counter, in Python, however, but the __next__ method is there.)

(*) if you happen to put your "class" statement, with its body and all, inside a loop or a function, it will be executed again each time it is run, but that would be just incorrect code.

Answered By: jsbueno
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.