How can I choose the line separator when reading a file?
Question:
I am trying to read a file which contains one single 2.9 GB long line separated by commas. This code would read the file line by line, with each print stopping at 'n'
:
with open('eggs.txt', 'rb') as file:
for line in file:
print(line)
How can I instead iterate over "lines" that stop at ', '
(or any other character/string)?
Answers:
I don’t think there is a built-in way to achieve this. You will have to use file.read(block_size)
to read the file block by block, split each block at commas, and rejoin strings that go across block boundaries manually.
Note that you still might run out of memory if you don’t encounter a comma for a long time. (The same problem applies to reading a file line by line, when encountering a very long line.)
Here’s an example implementation:
def split_file(file, sep=",", block_size=16384):
last_fragment = ""
while True:
block = file.read(block_size)
if not block:
break
block_fragments = iter(block.split(sep))
last_fragment += next(block_fragments)
for fragment in block_fragments:
yield last_fragment
last_fragment = fragment
yield last_fragment
Read the file a character at a time, and assemble the comma-separated lines:
def commaBreak(filename):
word = ""
with open(filename) as f:
while True:
char = f.read(1)
if not char:
print("End of file")
yield word
break
elif char == ',':
yield word
word = ""
else:
word += char
You may choose to do something like this with a larger number of charachters, Eg 1000, read at a time.
with open('eggs.txt', 'rb') as file:
for line in file:
str_line = str(line)
words = str_line.split(', ')
for word in words:
print(word)
Using buffered reading from the file (Python 3):
buffer_size = 2**12
delimiter = ','
with open(filename, 'r') as f:
# remember the characters after the last delimiter in the previously processed chunk
remaining = ""
while True:
# read the next chunk of characters from the file
chunk = f.read(buffer_size)
# end the loop if the end of the file has been reached
if not chunk:
break
# add the remaining characters from the previous chunk,
# split according to the delimiter, and keep the remaining
# characters after the last delimiter separately
*lines, remaining = (remaining + chunk).split(delimiter)
# print the parts up to each delimiter one by one
for line in lines:
print(line, end=delimiter)
# print the characters after the last delimiter in the file
if remaining:
print(remaining, end='')
Note that the way this is currently written, it will just print the original file’s contents exactly as they were. This is easily changed though, e.g. by changing the end=delimiter
parameter passed to the print()
function in the loop.
It yields each character from file at once, what means that there is no memory overloading.
def lazy_read():
try:
with open('eggs.txt', 'rb') as file:
item = file.read(1)
while item:
if ',' == item:
raise StopIteration
yield item
item = file.read(1)
except StopIteration:
pass
print(''.join(lazy_read()))
I am trying to read a file which contains one single 2.9 GB long line separated by commas. This code would read the file line by line, with each print stopping at 'n'
:
with open('eggs.txt', 'rb') as file:
for line in file:
print(line)
How can I instead iterate over "lines" that stop at ', '
(or any other character/string)?
I don’t think there is a built-in way to achieve this. You will have to use file.read(block_size)
to read the file block by block, split each block at commas, and rejoin strings that go across block boundaries manually.
Note that you still might run out of memory if you don’t encounter a comma for a long time. (The same problem applies to reading a file line by line, when encountering a very long line.)
Here’s an example implementation:
def split_file(file, sep=",", block_size=16384):
last_fragment = ""
while True:
block = file.read(block_size)
if not block:
break
block_fragments = iter(block.split(sep))
last_fragment += next(block_fragments)
for fragment in block_fragments:
yield last_fragment
last_fragment = fragment
yield last_fragment
Read the file a character at a time, and assemble the comma-separated lines:
def commaBreak(filename):
word = ""
with open(filename) as f:
while True:
char = f.read(1)
if not char:
print("End of file")
yield word
break
elif char == ',':
yield word
word = ""
else:
word += char
You may choose to do something like this with a larger number of charachters, Eg 1000, read at a time.
with open('eggs.txt', 'rb') as file:
for line in file:
str_line = str(line)
words = str_line.split(', ')
for word in words:
print(word)
Using buffered reading from the file (Python 3):
buffer_size = 2**12
delimiter = ','
with open(filename, 'r') as f:
# remember the characters after the last delimiter in the previously processed chunk
remaining = ""
while True:
# read the next chunk of characters from the file
chunk = f.read(buffer_size)
# end the loop if the end of the file has been reached
if not chunk:
break
# add the remaining characters from the previous chunk,
# split according to the delimiter, and keep the remaining
# characters after the last delimiter separately
*lines, remaining = (remaining + chunk).split(delimiter)
# print the parts up to each delimiter one by one
for line in lines:
print(line, end=delimiter)
# print the characters after the last delimiter in the file
if remaining:
print(remaining, end='')
Note that the way this is currently written, it will just print the original file’s contents exactly as they were. This is easily changed though, e.g. by changing the end=delimiter
parameter passed to the print()
function in the loop.
It yields each character from file at once, what means that there is no memory overloading.
def lazy_read():
try:
with open('eggs.txt', 'rb') as file:
item = file.read(1)
while item:
if ',' == item:
raise StopIteration
yield item
item = file.read(1)
except StopIteration:
pass
print(''.join(lazy_read()))