Is it better to open/close a file every time vs keeping it open until the process is finished?
Question:
I have a text file that is about 50 GB in size, and I am checking the first few characters of each line and writing those to other files specified for that beginning text.
For example, my input contains:
cow_ilovecow
dog_whreismydog
cat_thatcatshouldgotoreddit
dog_gotitfromshelter
...............
I want to process them in cow, dog, cat, etc. (about 200) categories,
if writeflag==1:
writefile1=open(writefile,"a") #writefile is somedir/dog.txt....
writefile1.write(remline+"n")
#writefile1.close()
What is the best way, should I close the file? Otherwise, if I keep it open, is writefile1=open(writefile,"a")
doing the right thing?
Answers:
Keep it open the whole time! Otherwise you tell the system that you are done writing all the time and it might decide to flush it onto the disk instead of buffering it. And for obvious reasons n disk writes are much more expensive than 1 disk write.
If you want to append to the file and not overwrite it then yes, a
is the correct mode.
Use the with
statement, it automatically closes the files for you, do all the operations inside the with
block, so it’ll keep the files open for you and will close the files once you’re out of the with
block.
with open(inputfile)as f1, open('dog.txt','a') as f2,open('cat.txt') as f3:
#do something here
EDIT:
If you know all the possible filenames to be used before the compilation of your code then using with
is a better option and if you don’t then you should use your approach but instead of closing the file you can flush
the data to the file using writefile1.flush()
IO operations consume too much time. Open and close the file, also.
It’s much faster if you open both files(the input and the output), use a memory buffer with, let’s say, 10MB size for your text processing and then write this to the output file. For example:
file = {} # just initializing dicts
filename = {}
with open(file) as f:
file['dog'] = None
buffer = ''
...
#maybe there is a loop here
if writeflag:
if file['dog'] == None:
file['dog'] = open(filename['dog'], 'a')
buffer += remline + 'n'
if len(buffer) > 1024*1000*10: # 10MB of text
files['dog'].write(buffer)
buffer = ''
for v in files.values():
v.close()
You should definitely try to open/close the file as little as possible
Because even comparing with file read/write, file open/close is far more expensive
Consider two code blocks:
f=open('test1.txt', 'w')
for i in range(1000):
f.write('n')
f.close()
and
for i in range(1000):
f=open('test2.txt', 'a')
f.write('n')
f.close()
The first one takes 0.025s while the second one takes 0.309s
I have a text file that is about 50 GB in size, and I am checking the first few characters of each line and writing those to other files specified for that beginning text.
For example, my input contains:
cow_ilovecow
dog_whreismydog
cat_thatcatshouldgotoreddit
dog_gotitfromshelter
...............
I want to process them in cow, dog, cat, etc. (about 200) categories,
if writeflag==1:
writefile1=open(writefile,"a") #writefile is somedir/dog.txt....
writefile1.write(remline+"n")
#writefile1.close()
What is the best way, should I close the file? Otherwise, if I keep it open, is writefile1=open(writefile,"a")
doing the right thing?
Keep it open the whole time! Otherwise you tell the system that you are done writing all the time and it might decide to flush it onto the disk instead of buffering it. And for obvious reasons n disk writes are much more expensive than 1 disk write.
If you want to append to the file and not overwrite it then yes, a
is the correct mode.
Use the with
statement, it automatically closes the files for you, do all the operations inside the with
block, so it’ll keep the files open for you and will close the files once you’re out of the with
block.
with open(inputfile)as f1, open('dog.txt','a') as f2,open('cat.txt') as f3:
#do something here
EDIT:
If you know all the possible filenames to be used before the compilation of your code then using with
is a better option and if you don’t then you should use your approach but instead of closing the file you can flush
the data to the file using writefile1.flush()
IO operations consume too much time. Open and close the file, also.
It’s much faster if you open both files(the input and the output), use a memory buffer with, let’s say, 10MB size for your text processing and then write this to the output file. For example:
file = {} # just initializing dicts
filename = {}
with open(file) as f:
file['dog'] = None
buffer = ''
...
#maybe there is a loop here
if writeflag:
if file['dog'] == None:
file['dog'] = open(filename['dog'], 'a')
buffer += remline + 'n'
if len(buffer) > 1024*1000*10: # 10MB of text
files['dog'].write(buffer)
buffer = ''
for v in files.values():
v.close()
You should definitely try to open/close the file as little as possible
Because even comparing with file read/write, file open/close is far more expensive
Consider two code blocks:
f=open('test1.txt', 'w')
for i in range(1000):
f.write('n')
f.close()
and
for i in range(1000):
f=open('test2.txt', 'a')
f.write('n')
f.close()
The first one takes 0.025s while the second one takes 0.309s