Chunking API response cuts off required data

Question:

I am reading chunks of data that is an API response using the following code:

d = zlib.decompressobj(zlib.MAX_WBITS|16)  # for gzip
for i in range(0, len(data), 4096):
    chunk = data[i:i+4096]
    # print(chunk)
    str_chunk = d.decompress(chunk)
    str_chunk = str_chunk.decode()
    # print(str_chunk)
    if '"@odata.nextLink"' in str_chunk:
        ab = '{' + str_chunk[str_chunk.index('"@odata.nextLink"'):len(str_chunk)+1]
        ab = ast.literal_eval(ab)
        url = ab['@odata.nextLink']
        return url

An example of this working is:
"@odata.nextLink":"someurl?$count=true

It works in most cases but sometimes this key value pair gets cut off and it appears something like this:
"@odata.nextLink":"someurl?$coun

I can play around with the number of bits in this line for i in range(0, len(data), 4096) but that doesn’t ensure that in some cases the data doesn’t cut off as the page sizes (data size) can be different for each page size.

How can I ensure that this key value pair is never cut off. Also, note that this key value pair is the last line/ last key-value pair of the API response.

P.S.: I can’t play around with API request parameters.

Even tried reading it backwards but this gives a header incorrect issue:

for i in range(len(data), 0, -4096):
                chunk = data[i -4096: i]
                str_chunk = d.decompress(chunk)
                str_chunk = str_chunk.decode()
                if '"@odata.nextLink"' in str_chunk:
                    ab = '{' + str_chunk[str_chunk.index('"@odata.nextLink"'):len(str_chunk)+1]
                    ab = ast.literal_eval(ab)
                    url = ab['@odata.nextLink']
                    #print(url)
                    return url

The above produces the following error which is really strange:

str_chunk = d.decompress(chunk)
zlib.error: Error -3 while decompressing data: incorrect header check
Asked By: qwerty

||

Answers:

str_chunk is a contiguous sequence of bytes from the API response that can start anywhere in the response, and end anywhere in the response. Of course it will sometimes end in the middle of some semantic content.

(New information from comment that OP neglected to put in question. In fact, still not in question. OP requires that entire uncompressed content not be saved in memory.)

If "@odata.nextLink" is a reliable marker for what you’re looking for, then keep the last two decompressed chunks, concatenate those, then look for that marker. Once found, continue to read more chunks, concatenating them, until you have the full content you’re looking for.

Answered By: Mark Adler

If the approach that Mark suggested in his answer is sufficient for you, it’s probably a good compromise and there is no need to over engineer it.

However, more generally, if you want to extract information from a stream the "proper" way of doing it is to parse the text character-by-character. Thereby, you can avoid any issues with chunk boundaries.

For example, let’s say that we want to extract values that are surrounded with @ symbols:

Lorem ipsum dolor @sit amet, consectetur@ adipiscing elit. Mauris
dapibus fermentum orci, vitae commodo odio suscipit et. Etiam
pellentesque turpis ut leo malesuada, quis scelerisque turpis condimentum.
Nulla consequat velit id pretium bibendum. Suspendisse potenti. Ut id
sagittis ante, quis tempor mauris. Sed volutpat sem a purus malesuada
varius. Pellentesque sit amet dolor at velit tristique fermentum. In
feugiat mauris ut @diam viverra aliquet.@ Morbi quis eros interdum,
lacinia mi at, suscipit lectus.

Donec in magna sed mauris auctor sollicitudin. Aenean molestie, diam sed 
aliquet malesuada, eros nunc ornare nunc, at bibendum ligula nulla et eros. 
Maecenas posuere eleifend elementum. Ut bibendum at arcu quis aliquam. Aliquam 
erat volutpat. Fusce luctus libero ac nisi lobortis lacinia. Aliquam ac rutrum 
odio. In hac habitasse platea dictumst. Vestibulum semper ullamcorper commodo. 
In hac habitasse platea dictumst. @Aenean ut pulvinar magna.@ Donec at euismod 
erat, eu iaculis metus. Proin vulputate mollis arcu, ut efficitur ligula 
fermentum et. Suspendisse tincidunt ultricies urna quis congue. Interdum et 
malesuada fames ac ante ipsum primis in faucibus. 

This can be done by creating a generator that parses the incoming stream and extracts a sequence of values:

import io
import typing

# Suppose this file is extremely long and doesn't fit into memory.
input_file = io.BytesIO(b"""
Lorem ipsum dolor @sit amet, consectetur@ adipiscing elit. Mauris
dapibus fermentum orci, vitae commodo odio suscipit et. Etiam
pellentesque turpis ut leo malesuada, quis scelerisque turpis condimentum.
Nulla consequat velit id pretium bibendum. Suspendisse potenti. Ut id
sagittis ante, quis tempor mauris. Sed volutpat sem a purus malesuada
varius. Pellentesque sit amet dolor at velit tristique fermentum. In
feugiat mauris ut @diam viverra aliquet.@ Morbi quis eros interdum,
lacinia mi at, suscipit lectus.

Donec in magna sed mauris auctor sollicitudin. Aenean molestie, diam sed 
aliquet malesuada, eros nunc ornare nunc, at bibendum ligula nulla et eros. 
Maecenas posuere eleifend elementum. Ut bibendum at arcu quis aliquam. Aliquam 
erat volutpat. Fusce luctus libero ac nisi lobortis lacinia. Aliquam ac rutrum 
odio. In hac habitasse platea dictumst. Vestibulum semper ullamcorper commodo. 
In hac habitasse platea dictumst. @Aenean ut pulvinar magna.@ Donec at euismod 
erat, eu iaculis metus. Proin vulputate mollis arcu, ut efficitur ligula 
fermentum et. Suspendisse tincidunt ultricies urna quis congue. Interdum et 
malesuada fames ac ante ipsum primis in faucibus.
""")

# This is a generator function which is essentially a custom iterator.
def extract_marked_values(raw_input_stream: typing.BinaryIO):
    # This will make the stream buffered and ensures that 'read(1)' is not extremely slow.
    # On top of that, it decodes the stream into UTF-8, thus the result is of type 'str' and not 'bytes'.
    text_input_stream = io.TextIOWrapper(raw_input_stream, encoding="utf-8")

    # Go through the text character by character and parse it.
    # Once a complete value return it with 'yield'.
    current_value: str = None
    while character := text_input_stream.read(1):
        if current_value is None:
            if character == "@":
                current_value = ""
        else:
            if character == "@":
                yield current_value
                current_value = None
            else:
                current_value += character

for value in extract_marked_values(input_file):
    print(value)

The trick here is that the parser is able to go character by character. Thus it doesn’t have to care about the boundaries between the chunks. (The chunks still exist, TextIOWrapper will internally read the input in chunks.)

You can generalize this to your problem. If your syntax is very complex you can break it up into multiple steps where you first extract the relevant substring and then in a second step extract the information from it.


When parsing more complex input, you don’t necessarily need to write the code to process each character one-by-one. Instead you can create abstractions to help.

For example, a Lexer class that wraps the stream and provides methods like lexer.try_consume("<marker>") or something like that.

Answered By: asynts
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.