Python string prints as [u'String']
Question:
This will surely be an easy one but it is really bugging me.
I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.
All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.
Every time I go to print out a variable that holds ‘String’ I get [u'String']
printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?
Answers:
Do you really mean u'String'
?
In any event, can’t you just do str(string)
to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)
Use dir
or type
on the ‘string’ to find out what it is. I suspect that it’s one of BeautifulSoup’s tag objects, that prints like a string, but really isn’t one. Otherwise, its inside a list and you need to convert each string separately.
In any case, why are you objecting to using Unicode? Any specific reason?
[u'ABC']
would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.
I don’t know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:
soup[0].encode("ascii")
However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it’s latin-1 or utf-8.
soup[0].encode("latin-1")
soup[0].encode("utf-8")
Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:
soup[0].encode(soup.originalEncoding)
You probably have a list containing one unicode string. The repr
of this is [u'String']
.
You can convert this to a list of byte strings using any variation of the following:
# Functional style.
print map(lambda x: x.encode('ascii'), my_list)
# List comprehension.
print [x.encode('ascii') for x in my_list]
# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)
# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)
# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)
If accessing/printing single element lists (e.g., sequentially or filtered):
my_list = [u'String'] # sample element
my_list = [str(my_list[0])]
pass the output to str() function and it will remove the unicode output u”.
also by printing the output it will remove the u” tags from it.
encode("latin-1")
helped me in my case:
facultyname[0].encode("latin-1")
[u'String']
is a text representation of a list that contains a Unicode string on Python 2.
If you run print(some_list)
then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list))
i.e., to create a text representation of a Python object with the type list
, repr()
function is called for each item.
Don’t confuse a Python object and its text representation—repr('a') != 'a'
and even the text representation of the text representation differs: repr(repr('a')) != repr('a')
.
repr(obj)
returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj
.
To avoid calling repr()
, you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)
—it prints a comma separated list of the strings: String
Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can’t represent all the characters e.g., if you try to use 'ascii'
encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.
import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r))
will print
{'name': 'A', 'primary_key': 1}
This will surely be an easy one but it is really bugging me.
I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.
All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.
Every time I go to print out a variable that holds ‘String’ I get [u'String']
printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?
Do you really mean u'String'
?
In any event, can’t you just do str(string)
to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)
Use dir
or type
on the ‘string’ to find out what it is. I suspect that it’s one of BeautifulSoup’s tag objects, that prints like a string, but really isn’t one. Otherwise, its inside a list and you need to convert each string separately.
In any case, why are you objecting to using Unicode? Any specific reason?
[u'ABC']
would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.
I don’t know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:
soup[0].encode("ascii")
However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it’s latin-1 or utf-8.
soup[0].encode("latin-1")
soup[0].encode("utf-8")
Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:
soup[0].encode(soup.originalEncoding)
You probably have a list containing one unicode string. The repr
of this is [u'String']
.
You can convert this to a list of byte strings using any variation of the following:
# Functional style.
print map(lambda x: x.encode('ascii'), my_list)
# List comprehension.
print [x.encode('ascii') for x in my_list]
# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)
# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)
# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)
If accessing/printing single element lists (e.g., sequentially or filtered):
my_list = [u'String'] # sample element
my_list = [str(my_list[0])]
pass the output to str() function and it will remove the unicode output u”.
also by printing the output it will remove the u” tags from it.
encode("latin-1")
helped me in my case:
facultyname[0].encode("latin-1")
[u'String']
is a text representation of a list that contains a Unicode string on Python 2.
If you run print(some_list)
then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list))
i.e., to create a text representation of a Python object with the type list
, repr()
function is called for each item.
Don’t confuse a Python object and its text representation—repr('a') != 'a'
and even the text representation of the text representation differs: repr(repr('a')) != repr('a')
.
repr(obj)
returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj
.
To avoid calling repr()
, you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)
—it prints a comma separated list of the strings: String
Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can’t represent all the characters e.g., if you try to use 'ascii'
encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.
import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r))
will print
{'name': 'A', 'primary_key': 1}