using a python generator to process large text files
Question:
I’m new to using generators and have read around a bit but need some help processing large text files in chunks. I know this topic has been covered but example code has very limited explanations making it difficult to modify the code if one doesn’t understand what is going on.
My problem is fairly simple, I have a series of large text files containing human genome sequencing data in the following format:
chr22 1 0
chr22 2 0
chr22 3 1
chr22 4 1
chr22 5 1
chr22 6 2
The files range between 1Gb and ~20Gb in length which is too big to read into RAM. So I would like to read the lines in chunks/bins of say 10000 lines at a time so that I can perform calculations on the final column in these bin sizes.
Based on this link here I have written the following:
def read_large_file(file_object):
"""A generator function to read a large file lazily."""
bin_size=5000
start=0
end=start+bin_size
# Read a block from the file: data
while True:
data = file_object.readlines(end)
if not data:
break
start=start+bin_size
end=end+bin_size
yield data
def process_file(path):
try:
# Open a connection to the file
with open(path) as file_handler:
# Create a generator object for the file: gen_file
for block in read_large_file(file_handler):
print(block)
# process block
except (IOError, OSError):
print("Error opening / processing file")
return
if __name__ == '__main__':
path='C:/path_to/input.txt'
process_file(path)
within ‘process_block’ I expected the returned ‘block’ object to be a list 10000 elements long but its not? The first list is 843 elements. The second is 2394 elements?
I want to get back ‘N’ number of lines in a block but am very confused by what is happening here?
This solution here seems like it could help but again I don’t understand how to modify it to read N-lines at a time?
This here also looks like a really great solution but again, there isn’t enough background explanation for me to understand enough to modify the code.
Any help would be really appreciated?
Answers:
Instead of playing with offsets in the file, try to build and yield lists of 10000 elements from a loop:
def read_large_file(file_handler, block_size=10000):
block = []
for line in file_handler:
block.append(line)
if len(block) == block_size:
yield block
block = []
# don't forget to yield the last block
if block:
yield block
with open(path) as file_handler:
for block in read_large_file(file_handler):
print(block)
In case it helps anyone else with a similar problem here is a solution based on here
import pandas as pd
def process_file(path,binSize):
for chunk in pd.read_csv(path, sep='t', chunksize=binSize):
print(chunk)
print(chunk.ix[:,2]) # get 3rd col
# Do something with chunk....
if __name__ == '__main__':
path='path_to/infile.txt'
binSize=5000
process_file(path,binSize)
Not a proper answer but finding out the why of this behaviour takes approximately 27 seconds:
(blook)bruno@bigb:~/Work/blookup/src/project$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
pythonrc start
pythonrc done
>>> help(file.readlines)
Help on method_descriptor:
readlines(...)
readlines([size]) -> list of strings, each a line from the file.
Call readline() repeatedly and return a list of the lines so read.
The optional size argument, if given, is an approximate bound on the
total number of bytes in the lines returned.
I understand that not everyone here is a professional programmer – and of course that the documentation is not always enough to solve a problem (and I happily answer those kind of questions), but really the number of questions where the answer is written in plain letters at the start of the doc becomes a bit annoying.
— Adding on to the answer given —
When i was reading file in chunk let’s suppose a text file with the name of split.txt the issue i was facing while reading in chunks was I had a use case where i was processing the data line by line and just because the text file i was reading in chunks it(chunk of file) sometimes end with partial lines that end up breaking my code(since it was expecting the complete line to be processed)
so after reading here and there I came to know I can overcome this issue by keeping a track of the last bit in the chunk so what I did was if the chunk has a /n in it that means the chunk consists of a complete line otherwise I usually store the partial last line and keep it in a variable so that I can use this bit and concatenate it with the next unfinished line coming in the next chunk with this I successfully able to get over this issue.
sample code :-
# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
# file where i am writing my final output
write_file=open('split.txt','w')
# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1
try:
with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
for piece in read_in_chunks(f):
#print('---->>>',piece,'<<<--')
line_by_line = piece.split('n')
for one_line in line_by_line:
# if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one
if placeholder:
# print('----->',placeholder)
# concatinating the previous partial line with the current one
one_line=placeholder+one_line
# then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
placeholder=''
# futher logic that revolves around my specific use case
segregated_data= one_line.split('~')
#print(len(segregated_data),type(segregated_data), one_line)
if len(segregated_data) < 18:
placeholder=one_line
continue
else:
placeholder=''
#print('--------',segregated_data)
if segregated_data[2]=='2020' and segregated_data[3]=='2021':
#write this
data=str("~".join(segregated_data))
#print('data',data)
#f.write(data)
write_file.write(data)
write_file.write('n')
print(write_file.tell())
elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
#write this
data=str("-".join(segregated_data))
write_file.write(data)
write_file.write('n')
print(write_file.tell())
except Exception as e:
print('error is', e)
I’m new to using generators and have read around a bit but need some help processing large text files in chunks. I know this topic has been covered but example code has very limited explanations making it difficult to modify the code if one doesn’t understand what is going on.
My problem is fairly simple, I have a series of large text files containing human genome sequencing data in the following format:
chr22 1 0
chr22 2 0
chr22 3 1
chr22 4 1
chr22 5 1
chr22 6 2
The files range between 1Gb and ~20Gb in length which is too big to read into RAM. So I would like to read the lines in chunks/bins of say 10000 lines at a time so that I can perform calculations on the final column in these bin sizes.
Based on this link here I have written the following:
def read_large_file(file_object):
"""A generator function to read a large file lazily."""
bin_size=5000
start=0
end=start+bin_size
# Read a block from the file: data
while True:
data = file_object.readlines(end)
if not data:
break
start=start+bin_size
end=end+bin_size
yield data
def process_file(path):
try:
# Open a connection to the file
with open(path) as file_handler:
# Create a generator object for the file: gen_file
for block in read_large_file(file_handler):
print(block)
# process block
except (IOError, OSError):
print("Error opening / processing file")
return
if __name__ == '__main__':
path='C:/path_to/input.txt'
process_file(path)
within ‘process_block’ I expected the returned ‘block’ object to be a list 10000 elements long but its not? The first list is 843 elements. The second is 2394 elements?
I want to get back ‘N’ number of lines in a block but am very confused by what is happening here?
This solution here seems like it could help but again I don’t understand how to modify it to read N-lines at a time?
This here also looks like a really great solution but again, there isn’t enough background explanation for me to understand enough to modify the code.
Any help would be really appreciated?
Instead of playing with offsets in the file, try to build and yield lists of 10000 elements from a loop:
def read_large_file(file_handler, block_size=10000):
block = []
for line in file_handler:
block.append(line)
if len(block) == block_size:
yield block
block = []
# don't forget to yield the last block
if block:
yield block
with open(path) as file_handler:
for block in read_large_file(file_handler):
print(block)
In case it helps anyone else with a similar problem here is a solution based on here
import pandas as pd
def process_file(path,binSize):
for chunk in pd.read_csv(path, sep='t', chunksize=binSize):
print(chunk)
print(chunk.ix[:,2]) # get 3rd col
# Do something with chunk....
if __name__ == '__main__':
path='path_to/infile.txt'
binSize=5000
process_file(path,binSize)
Not a proper answer but finding out the why of this behaviour takes approximately 27 seconds:
(blook)bruno@bigb:~/Work/blookup/src/project$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
pythonrc start
pythonrc done
>>> help(file.readlines)
Help on method_descriptor:
readlines(...)
readlines([size]) -> list of strings, each a line from the file.
Call readline() repeatedly and return a list of the lines so read.
The optional size argument, if given, is an approximate bound on the
total number of bytes in the lines returned.
I understand that not everyone here is a professional programmer – and of course that the documentation is not always enough to solve a problem (and I happily answer those kind of questions), but really the number of questions where the answer is written in plain letters at the start of the doc becomes a bit annoying.
— Adding on to the answer given —
When i was reading file in chunk let’s suppose a text file with the name of split.txt the issue i was facing while reading in chunks was I had a use case where i was processing the data line by line and just because the text file i was reading in chunks it(chunk of file) sometimes end with partial lines that end up breaking my code(since it was expecting the complete line to be processed)
so after reading here and there I came to know I can overcome this issue by keeping a track of the last bit in the chunk so what I did was if the chunk has a /n in it that means the chunk consists of a complete line otherwise I usually store the partial last line and keep it in a variable so that I can use this bit and concatenate it with the next unfinished line coming in the next chunk with this I successfully able to get over this issue.
sample code :-
# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
# file where i am writing my final output
write_file=open('split.txt','w')
# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1
try:
with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
for piece in read_in_chunks(f):
#print('---->>>',piece,'<<<--')
line_by_line = piece.split('n')
for one_line in line_by_line:
# if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one
if placeholder:
# print('----->',placeholder)
# concatinating the previous partial line with the current one
one_line=placeholder+one_line
# then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
placeholder=''
# futher logic that revolves around my specific use case
segregated_data= one_line.split('~')
#print(len(segregated_data),type(segregated_data), one_line)
if len(segregated_data) < 18:
placeholder=one_line
continue
else:
placeholder=''
#print('--------',segregated_data)
if segregated_data[2]=='2020' and segregated_data[3]=='2021':
#write this
data=str("~".join(segregated_data))
#print('data',data)
#f.write(data)
write_file.write(data)
write_file.write('n')
print(write_file.tell())
elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
#write this
data=str("-".join(segregated_data))
write_file.write(data)
write_file.write('n')
print(write_file.tell())
except Exception as e:
print('error is', e)