Delete weird ANSI character and convert accented ones using Python

Question:

I’ve downloaded a bunch of Spanish tweets using the Twitter API, but some of them have strange ANSI characters that I don’t want there. I have around 18000 files and I want to remove those characters. I have all my files encoded as UTF-8.
For example:

b'Me quedo con una frase de nuestra reunixc3xb3n de hoy.'

If they are accented characters (we have plenty in spanish) I want to delete the accented letter and replace it for the non-accented version of it. That’s because after that I’m doing some text mining analysis and I want to unify the words because there could be people not using accents.
That b means is in byte mode, I think.

In the case before if I put the following in python:

print(u'Me quedo con una frase de nuestra reunixc3xb3n de hoy con @Colegas')

And I get this in the terminal:

Me quedo con una frase de nuestra reunión de hoy con @Colegas

Which I don’t like because it’s not a used accent in Spanish. There should be the character ó. I don’t get why is nor getting it right.
I also would like the b at the beginning of the files to disappear.
To encode the files I used the following:

f.write(str(FILE.encode('utf-8','strict')))

There I create my files from some json in UTF-8 which contains a lot of keys for each tweet. Maybe I should change it or I’m doing something wrong there.

In some cases there’s also a problem when trying to get the characters in the python terminal. For instance:

print(u'uD83DuDC1F')

I think that’s because python can’t represent those characters (� in the example above). Is that so? I would also want to remove them.

Sorry if there’s some English mistakes and feel free to ask if something is not clear.

Thanks in advance.

EDIT: I’m using Python 3.4

Asked By: Ignacio

||

Answers:

First of all : you need to be 100% sure in what language those characters are coded in twitter. If you are sure that it is ANSI (normally spanish encoding language will be Latin-1), then everything you get from twitter you need to use this function

a = getStufFromTwitter() #you parse twitter 
myStr = a.encode('Latin-1') 

the .encode(‘ANSI’) will tell python that everything you are taking from the outside is written in ANSI and he should convert it in Unicode.

Then, whenever you want to re use myStr in any part of your program (especially if you want to write it somewhere), you have to use the decode function. IN your case that will be :

with open('myfile.txt','w') as f:
    f.write(myStr.decode('UTF-8'))

This should work. However it would be easier to help you if we could see much of the code. You have some very vicious specifications in Python (are you using Python 2.7 ? If yes, add at the begining of every of your script the folowing :

from __future__ import unicode_literals 

Once again, it is a very tricky part of python.

Answered By: Dirty_Fox

You are mixing apples and oranges. b'reunixc3xb3n' is the UTF-8 encoding of u'reuniu00f3n' which of course is reunión in human-readable format.

>>> print b'reunixc3xb3n'.decode('utf-8')
reunión
>>> repr(b'reunixc3xb3n'.decode('utf-8'))
"u'reuni\xf3n'"

There is no “ANSI” here (it’s a misnomer anyway; commonly it is used to refer to Windows character encodings, but not necessarily the one you expect).

As for how to remove the accents from accented characters, the short version is to normalize to the Unicode “NFD” representation, then discard any code points which have a “diacritic” classification. This is covered e.g. in What is the best way to remove accents in a Python unicode string?; in order to make this answer self-contained, here is the gist of one of the answers to that question — but do read all of them, if only to decide which suits your use case the best.

import unicodedata
stripped = u"".join([c for c in unicodedata.normalize('NFKD', input_str)
    if not unicodedata.combining(c)])
Answered By: tripleee

One of the patterns with handling incoming text (in the form of bytes) in Python 3 is to decode them immediately when received.

So you get from twitter;

In [1]: tweetbytes = b'Me quedo con una frase de nuestra reunixc3xb3n de hoy.'

And you do;

In [2]: tweet = tweetbytes.decode('utf-8')

Remember the acronym BADTIE; Bytes Are Decoded, Text Is Encoded.

Now it is text;

In [3]: type(tweet)
Out[3]: str

And you can use it as such;

In [4]: print(tweet)
Me quedo con una frase de nuestra reunión de hoy.
Answered By: Roland Smith
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.