convert io.StringIO to io.BytesIO
Question:
original question: i got a StringIO object, how can i convert it into BytesIO?
update: The more general question is, how to convert a binary (encoded) file-like object into decoded file-like object in python3?
the naive approach i got is:
import io
sio = io.StringIO('wello horld')
bio = io.BytesIO(sio.read().encode('utf8'))
print(bio.read()) # prints b'wello horld'
is there more efficient and elegant way of doing this? the above code just reads everything into memory, encodes it instead of streaming the data in chunks.
for example, for the reverse question (BytesIO
-> StringIO
) there exist a class – io.TextIOWrapper which does exactly that (see this answer)
Answers:
It could be a generally useful tool to convert a character stream into a byte stream, so here goes:
import io
class EncodeIO(io.BufferedIOBase):
def __init__(self,s,e='utf-8'):
self.stream=s # not raw, since it isn't
self.encoding=e
self.buf=b"" # encoded but not yet returned
def _read(self,s): return self.stream.read(s).encode(self.encoding)
def read(self,size=-1):
b=self.buf
self.buf=b""
if size is None or size<0: return b+self._read(None)
ret=[]
while True:
n=len(b)
if size<n:
b,self.buf=b[:size],b[size:]
n=size
ret.append(b)
size-=n
if not size: break
b=self._read(min((size+1024)//2,size))
if not b: break
return b"".join(ret)
read1=read
Obviously write
could be defined symmetrically to decode input and send it to the underlying stream, although then you have to deal with having enough bytes for only part of a character.
bio
from your example is _io.BytesIO
class object.
You have used 2 times the read()
function.
I came up with bytes
conversion and one read()
method:
sio = io.StringIO('wello horld')
b = bytes(sio.read(), encoding='utf-8')
print(b)
But the second variant should be even faster:
sio = io.StringIO('wello horld')
b = sio.read().encode()
print(b)
As some pointed out, you need to do the encoding/decoding yourself.
However, you can achieve this in an elegant way – implementing your own TextIOWrapper
for string => bytes
.
Here is such a sample:
class BytesIOWrapper:
def __init__(self, string_buffer, encoding='utf-8'):
self.string_buffer = string_buffer
self.encoding = encoding
def __getattr__(self, attr):
return getattr(self.string_buffer, attr)
def read(self, size=-1):
content = self.string_buffer.read(size)
return content.encode(self.encoding)
def write(self, b):
content = b.decode(self.encoding)
return self.string_buffer.write(content)
Which produces an output like this:
In [36]: bw = BytesIOWrapper(StringIO("some lengt˙˚hyÔstring in here"))
In [37]: bw.read(15)
Out[37]: b'some lengtxcbx99xcbx9ahyxc3x94'
In [38]: bw.tell()
Out[38]: 15
In [39]: bw.write(b'ME')
Out[39]: 2
In [40]: bw.seek(15)
Out[40]: 15
In [41]: bw.read()
Out[41]: b'MEring in here'
Hope it clears your thoughts!
@foobarna answer can be improved by inheriting some io
base-class
import io
sio = io.StringIO('wello horld')
class BytesIOWrapper(io.BufferedReader):
"""Wrap a buffered bytes stream over TextIOBase string stream."""
def __init__(self, text_io_buffer, encoding=None, errors=None, **kwargs):
super(BytesIOWrapper, self).__init__(text_io_buffer, **kwargs)
self.encoding = encoding or text_io_buffer.encoding or 'utf-8'
self.errors = errors or text_io_buffer.errors or 'strict'
def _encoding_call(self, method_name, *args, **kwargs):
raw_method = getattr(self.raw, method_name)
val = raw_method(*args, **kwargs)
return val.encode(self.encoding, errors=self.errors)
def read(self, size=-1):
return self._encoding_call('read', size)
def read1(self, size=-1):
return self._encoding_call('read1', size)
def peek(self, size=-1):
return self._encoding_call('peek', size)
bio = BytesIOWrapper(sio)
print(bio.read()) # b'wello horld'
It’s interesting that though the question might seem reasonable, it’s not that easy to figure out a practical reason why I would need to convert a StringIO
into a BytesIO
. Both are basically buffers and you usually need only one of them to make some additional manipulations either with the bytes or with the text.
I may be wrong, but I think your question is actually how to use a BytesIO
instance when some code to which you want to pass it expects a text file.
In which case, it is a common question and the solution is codecs module.
The two usual cases of using it are the following:
Compose a File Object to Read
In [16]: import codecs, io
In [17]: bio = io.BytesIO(b'qwenasdn')
In [18]: StreamReader = codecs.getreader('utf-8') # here you pass the encoding
In [19]: wrapper_file = StreamReader(bio)
In [20]: print(repr(wrapper_file.readline()))
'qwen'
In [21]: print(repr(wrapper_file.read()))
'asdn'
In [26]: bio.seek(0)
Out[26]: 0
In [27]: for line in wrapper_file:
...: print(repr(line))
...:
'qwen'
'asdn'
Compose a File Object to Write To
In [28]: bio = io.BytesIO()
In [29]: StreamWriter = codecs.getwriter('utf-8') # here you pass the encoding
In [30]: wrapper_file = StreamWriter(bio)
In [31]: print('жаба', 'цап', file=wrapper_file)
In [32]: bio.getvalue()
Out[32]: b'xd0xb6xd0xb0xd0xb1xd0xb0 xd1x86xd0xb0xd0xbfn'
In [33]: repr(bio.getvalue().decode('utf-8'))
Out[33]: "'жаба цап\n'"
I had the exact same need, so I created an EncodedStreamReader
class in the nr.utils.io
package. It also solves the issue with actually reading the number of bytes requested instead of the number of characters from the wrapped stream.
$ pip install 'nr.utils.io>=0.1.0,<1.0.0'
Example usage:
import io
from nr.utils.io.readers import EncodedStreamReader
fp = EncodedStreamReader(io.StringIO('ä'), 'utf-8')
assert fp.read(1) == b'xc3'
assert fp.read(1) == b'xa4'
original question: i got a StringIO object, how can i convert it into BytesIO?
update: The more general question is, how to convert a binary (encoded) file-like object into decoded file-like object in python3?
the naive approach i got is:
import io
sio = io.StringIO('wello horld')
bio = io.BytesIO(sio.read().encode('utf8'))
print(bio.read()) # prints b'wello horld'
is there more efficient and elegant way of doing this? the above code just reads everything into memory, encodes it instead of streaming the data in chunks.
for example, for the reverse question (BytesIO
-> StringIO
) there exist a class – io.TextIOWrapper which does exactly that (see this answer)
It could be a generally useful tool to convert a character stream into a byte stream, so here goes:
import io
class EncodeIO(io.BufferedIOBase):
def __init__(self,s,e='utf-8'):
self.stream=s # not raw, since it isn't
self.encoding=e
self.buf=b"" # encoded but not yet returned
def _read(self,s): return self.stream.read(s).encode(self.encoding)
def read(self,size=-1):
b=self.buf
self.buf=b""
if size is None or size<0: return b+self._read(None)
ret=[]
while True:
n=len(b)
if size<n:
b,self.buf=b[:size],b[size:]
n=size
ret.append(b)
size-=n
if not size: break
b=self._read(min((size+1024)//2,size))
if not b: break
return b"".join(ret)
read1=read
Obviously write
could be defined symmetrically to decode input and send it to the underlying stream, although then you have to deal with having enough bytes for only part of a character.
bio
from your example is _io.BytesIO
class object.
You have used 2 times the read()
function.
I came up with bytes
conversion and one read()
method:
sio = io.StringIO('wello horld')
b = bytes(sio.read(), encoding='utf-8')
print(b)
But the second variant should be even faster:
sio = io.StringIO('wello horld')
b = sio.read().encode()
print(b)
As some pointed out, you need to do the encoding/decoding yourself.
However, you can achieve this in an elegant way – implementing your own TextIOWrapper
for string => bytes
.
Here is such a sample:
class BytesIOWrapper:
def __init__(self, string_buffer, encoding='utf-8'):
self.string_buffer = string_buffer
self.encoding = encoding
def __getattr__(self, attr):
return getattr(self.string_buffer, attr)
def read(self, size=-1):
content = self.string_buffer.read(size)
return content.encode(self.encoding)
def write(self, b):
content = b.decode(self.encoding)
return self.string_buffer.write(content)
Which produces an output like this:
In [36]: bw = BytesIOWrapper(StringIO("some lengt˙˚hyÔstring in here"))
In [37]: bw.read(15)
Out[37]: b'some lengtxcbx99xcbx9ahyxc3x94'
In [38]: bw.tell()
Out[38]: 15
In [39]: bw.write(b'ME')
Out[39]: 2
In [40]: bw.seek(15)
Out[40]: 15
In [41]: bw.read()
Out[41]: b'MEring in here'
Hope it clears your thoughts!
@foobarna answer can be improved by inheriting some io
base-class
import io
sio = io.StringIO('wello horld')
class BytesIOWrapper(io.BufferedReader):
"""Wrap a buffered bytes stream over TextIOBase string stream."""
def __init__(self, text_io_buffer, encoding=None, errors=None, **kwargs):
super(BytesIOWrapper, self).__init__(text_io_buffer, **kwargs)
self.encoding = encoding or text_io_buffer.encoding or 'utf-8'
self.errors = errors or text_io_buffer.errors or 'strict'
def _encoding_call(self, method_name, *args, **kwargs):
raw_method = getattr(self.raw, method_name)
val = raw_method(*args, **kwargs)
return val.encode(self.encoding, errors=self.errors)
def read(self, size=-1):
return self._encoding_call('read', size)
def read1(self, size=-1):
return self._encoding_call('read1', size)
def peek(self, size=-1):
return self._encoding_call('peek', size)
bio = BytesIOWrapper(sio)
print(bio.read()) # b'wello horld'
It’s interesting that though the question might seem reasonable, it’s not that easy to figure out a practical reason why I would need to convert a StringIO
into a BytesIO
. Both are basically buffers and you usually need only one of them to make some additional manipulations either with the bytes or with the text.
I may be wrong, but I think your question is actually how to use a BytesIO
instance when some code to which you want to pass it expects a text file.
In which case, it is a common question and the solution is codecs module.
The two usual cases of using it are the following:
Compose a File Object to Read
In [16]: import codecs, io
In [17]: bio = io.BytesIO(b'qwenasdn')
In [18]: StreamReader = codecs.getreader('utf-8') # here you pass the encoding
In [19]: wrapper_file = StreamReader(bio)
In [20]: print(repr(wrapper_file.readline()))
'qwen'
In [21]: print(repr(wrapper_file.read()))
'asdn'
In [26]: bio.seek(0)
Out[26]: 0
In [27]: for line in wrapper_file:
...: print(repr(line))
...:
'qwen'
'asdn'
Compose a File Object to Write To
In [28]: bio = io.BytesIO()
In [29]: StreamWriter = codecs.getwriter('utf-8') # here you pass the encoding
In [30]: wrapper_file = StreamWriter(bio)
In [31]: print('жаба', 'цап', file=wrapper_file)
In [32]: bio.getvalue()
Out[32]: b'xd0xb6xd0xb0xd0xb1xd0xb0 xd1x86xd0xb0xd0xbfn'
In [33]: repr(bio.getvalue().decode('utf-8'))
Out[33]: "'жаба цап\n'"
I had the exact same need, so I created an EncodedStreamReader
class in the nr.utils.io
package. It also solves the issue with actually reading the number of bytes requested instead of the number of characters from the wrapped stream.
$ pip install 'nr.utils.io>=0.1.0,<1.0.0'
Example usage:
import io
from nr.utils.io.readers import EncodedStreamReader
fp = EncodedStreamReader(io.StringIO('ä'), 'utf-8')
assert fp.read(1) == b'xc3'
assert fp.read(1) == b'xa4'