What is the use of buffering in python's built-in open() function?
Question:
Python Documentation : https://docs.python.org/2/library/functions.html#open
open(name[, mode[, buffering]])
The above documentation says “The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default.If omitted, the system default is used.”.
When I use
filedata = open(file.txt,"r",0)
or
filedata = open(file.txt,"r",1)
or
filedata = open(file.txt,"r",2)
or
filedata = open(file.txt,"r",-1)
or
filedata = open(file.txt,"r")
The output has no change. Each line shown above prints at same speed.
output:
Mr. Bean is a British television programme series of fifteen 25-
minute episodes written by Robin Driscoll and starring Rowan Atkinson
as
the title character. Different episodes were also written by Robin
Driscoll and Richard Curtis, and one by Ben Elton. Thirteen of the
episodes were broadcast on ITV, from the pilot on 1 January 1990,
until
“Goodnight Mr. Bean” on 31 October 1995. A clip show, “The Best Bits
of
Mr. Bean”, was broadcast on 15 December 1995, and one episode, “Hair
by
Mr. Bean of London”, was not broadcast until 2006 on
Nickelodeon.
Then how the buffering parameter in the open() function is useful? What
value
of that buffering parameter is best to use?
Answers:
Enabling buffering means that you’re not directly interfacing with the OS’s representation of a file, or its file system API. Instead, a chunk of data is read from the raw OS filestream into a buffer until it is consumed, at which point more data is fetched into the buffer. In terms of the objects you get, you’ll get a BufferedIOBase
object wrapping an underlying RawIOBase
(which represents the raw file stream).
What is the benefit of this? Well interfacing with the raw stream might have high latency, because the operating system has to fool around with physical objects like the hard disk, and this may not be acceptable in all cases. Let’s say you want to read three letters from a file every 5ms and your file is on a crusty old hard disk, or even a network file system. Instead of trying to read from the raw filestream every 5ms, it is better to load a bunch of bytes from the file into a buffer in memory, then consume it at will.
What size of buffer you choose will depend on how you’re consuming the data. For the example above, a buffer size of 1 char would be awful, 3 chars would be alright, and any large multiple of 3 chars that doesn’t cause a noticeable delay for your users would be ideal.
You can also check the default buffer size by calling the read only DEFAULT_BUFFER_SIZE attribute from io module.
import io
print (io.DEFAULT_BUFFER_SIZE)
As described here
Buffering is the process of storing a chunk of a file in a temporary memory until the file loads completely. In python there are different values can be given. If the buffering is set to 0 , then the buffering is off. The buffering will be set to 1 when we need to buffer the file.
With buffering set to -1 my file write took 13 minutes. With buffering set to 2**10 my file write took 7 seconds. So, the purpose of buffering is to speed up your program.
What is perhaps important from practical point of view is that the buffering parameter determines when the data you are sending to the stream is actually saved to disk.
When you open a file without the buffering parameter, and write some stuff to it, you will see the data is written only after the with open(...) as foo:
block is exited (or when the file’s close()
method is called), or when some system-determined default buffer size is reached. But if you set the buffering
parameter, it will write the data as soon as that size of the buffer is reached.
Thus using i.e. open('file.txt', 'w', buffering=1)
is a useful thing to do when you have a long-running application, and you are sending some data to a file, and you want it to save after each line, and not only after the application quits. Otherwise a crash, or a power outage, etc. could cause the data to be lost.
See also: How often does python flush to a file?
Python Documentation : https://docs.python.org/2/library/functions.html#open
open(name[, mode[, buffering]])
The above documentation says “The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default.If omitted, the system default is used.”.
When I use
filedata = open(file.txt,"r",0)
or
filedata = open(file.txt,"r",1)
or
filedata = open(file.txt,"r",2)
or
filedata = open(file.txt,"r",-1)
or
filedata = open(file.txt,"r")
The output has no change. Each line shown above prints at same speed.
output:
Mr. Bean is a British television programme series of fifteen 25-
minute episodes written by Robin Driscoll and starring Rowan Atkinson
asthe title character. Different episodes were also written by Robin
Driscoll and Richard Curtis, and one by Ben Elton. Thirteen of the
episodes were broadcast on ITV, from the pilot on 1 January 1990,
until“Goodnight Mr. Bean” on 31 October 1995. A clip show, “The Best Bits
ofMr. Bean”, was broadcast on 15 December 1995, and one episode, “Hair
byMr. Bean of London”, was not broadcast until 2006 on
Nickelodeon.
Then how the buffering parameter in the open() function is useful? What
value
of that buffering parameter is best to use?
Enabling buffering means that you’re not directly interfacing with the OS’s representation of a file, or its file system API. Instead, a chunk of data is read from the raw OS filestream into a buffer until it is consumed, at which point more data is fetched into the buffer. In terms of the objects you get, you’ll get a BufferedIOBase
object wrapping an underlying RawIOBase
(which represents the raw file stream).
What is the benefit of this? Well interfacing with the raw stream might have high latency, because the operating system has to fool around with physical objects like the hard disk, and this may not be acceptable in all cases. Let’s say you want to read three letters from a file every 5ms and your file is on a crusty old hard disk, or even a network file system. Instead of trying to read from the raw filestream every 5ms, it is better to load a bunch of bytes from the file into a buffer in memory, then consume it at will.
What size of buffer you choose will depend on how you’re consuming the data. For the example above, a buffer size of 1 char would be awful, 3 chars would be alright, and any large multiple of 3 chars that doesn’t cause a noticeable delay for your users would be ideal.
You can also check the default buffer size by calling the read only DEFAULT_BUFFER_SIZE attribute from io module.
import io
print (io.DEFAULT_BUFFER_SIZE)
As described here
Buffering is the process of storing a chunk of a file in a temporary memory until the file loads completely. In python there are different values can be given. If the buffering is set to 0 , then the buffering is off. The buffering will be set to 1 when we need to buffer the file.
With buffering set to -1 my file write took 13 minutes. With buffering set to 2**10 my file write took 7 seconds. So, the purpose of buffering is to speed up your program.
What is perhaps important from practical point of view is that the buffering parameter determines when the data you are sending to the stream is actually saved to disk.
When you open a file without the buffering parameter, and write some stuff to it, you will see the data is written only after the with open(...) as foo:
block is exited (or when the file’s close()
method is called), or when some system-determined default buffer size is reached. But if you set the buffering
parameter, it will write the data as soon as that size of the buffer is reached.
Thus using i.e. open('file.txt', 'w', buffering=1)
is a useful thing to do when you have a long-running application, and you are sending some data to a file, and you want it to save after each line, and not only after the application quits. Otherwise a crash, or a power outage, etc. could cause the data to be lost.
See also: How often does python flush to a file?