How can I choose the line separator when reading a file?

Question

I am trying to read a file which contains one single 2.9 GB long line separated by commas. This code would read the file line by line, with each print stopping at 'n':

with open('eggs.txt', 'rb') as file:
    for line in file:
        print(line)

How can I instead iterate over "lines" that stop at ', ' (or any other character/string)?

Asked By: RetroCode

||

Source

Answer 1

I don’t think there is a built-in way to achieve this. You will have to use file.read(block_size) to read the file block by block, split each block at commas, and rejoin strings that go across block boundaries manually.

Note that you still might run out of memory if you don’t encounter a comma for a long time. (The same problem applies to reading a file line by line, when encountering a very long line.)

Here’s an example implementation:

def split_file(file, sep=",", block_size=16384):
    last_fragment = ""
    while True:
        block = file.read(block_size)
        if not block:
            break
        block_fragments = iter(block.split(sep))
        last_fragment += next(block_fragments)
        for fragment in block_fragments:
            yield last_fragment
            last_fragment = fragment
    yield last_fragment

Answered By: Sven Marnach

Answer 2

Read the file a character at a time, and assemble the comma-separated lines:

def commaBreak(filename):
    word = ""
    with open(filename) as f:
        while True:
            char = f.read(1)
            if not char:
                print("End of file")
                yield word
                break
            elif char == ',':
                yield word
                word = ""
            else:
                word += char

You may choose to do something like this with a larger number of charachters, Eg 1000, read at a time.

Answered By: Vaibhav Bajaj

Answer 3

with open('eggs.txt', 'rb') as file:
    for line in file:
        str_line = str(line)
        words = str_line.split(', ')
        for word in words:
            print(word)

Answered By: Drew Davis

Answer 4

Using buffered reading from the file (Python 3):

buffer_size = 2**12
delimiter = ','

with open(filename, 'r') as f:
    # remember the characters after the last delimiter in the previously processed chunk
    remaining = ""

    while True:
        # read the next chunk of characters from the file
        chunk = f.read(buffer_size)

        # end the loop if the end of the file has been reached
        if not chunk:
            break

        # add the remaining characters from the previous chunk,
        # split according to the delimiter, and keep the remaining
        # characters after the last delimiter separately
        *lines, remaining = (remaining + chunk).split(delimiter)

        # print the parts up to each delimiter one by one
        for line in lines:
            print(line, end=delimiter)

    # print the characters after the last delimiter in the file
    if remaining:
        print(remaining, end='')

Note that the way this is currently written, it will just print the original file’s contents exactly as they were. This is easily changed though, e.g. by changing the end=delimiter parameter passed to the print() function in the loop.

Answered By: taleinat

Answer 5

It yields each character from file at once, what means that there is no memory overloading.

def lazy_read():
    try:
        with open('eggs.txt', 'rb') as file:
            item = file.read(1)
            while item:
                if ',' == item:
                    raise StopIteration
                yield item
                item = file.read(1)
    except StopIteration:
        pass

print(''.join(lazy_read()))

Answered By: turkus

How can I choose the line separator when reading a file?

Question:

Answers: