Some annoying characters are not normalised by unicodedata

Question:

I have a python string that looks like as shown below. This string is from the SEC filing of one public company in the US. I am trying to remove some annoying characters from the string using unicodedata.normalise function, but this is not removing all characters. What could be the reason behind such behavior?

from unicodedata import normalize
s = '[email protected].:xa0 312-233-2266nxa0nJPMorgan Chase Bank,nN.A., as Administrative Agentn10 South Dearborn, Floor 7thnIL1-0010nChicago, IL 60603-2003nAttention:xa0 Hiral PatelnFacsimile No.:xa0 312-385-7096nxa0nLadies and Gentlemen:nxa0nReference is made to thenCredit Agreement, dated as of Mayxa07, 2010 (as the same may be amended,nrestated, supplemented or otherwise modified from time to time, the x93Credit Agreementx94), by and amongnHawaiian Electric Industries,xa0Inc., a Hawaii corporation (the x93Borrowerx94), the Lenders from time tontime party thereto and JPMorgan Chase Bank, N.A., as issuing bank andnadministrative agent (the x93Administrative Agentx94).'

normalize('NFKC', s)
'[email protected].:  312-233-2266n nJPMorgan Chase Bank,nN.A., as Administrative Agentn10 South Dearborn, Floor 7thnIL1-0010nChicago, IL 60603-2003nAttention:  Hiral PatelnFacsimile No.:  312-385-7096n nLadies and Gentlemen:n nReference is made to thenCredit Agreement, dated as of May 7, 2010 (as the same may be amended,nrestated, supplemented or otherwise modified from time to time, the x93Credit Agreementx94), by and amongnHawaiian Electric Industries, Inc., a Hawaii corporation (the x93Borrowerx94), the Lenders from time tontime party thereto and JPMorgan Chase Bank, N.A., as issuing bank andnadministrative agent (the x93Administrative Agentx94).'

As one can see from the outputs, the characters xa0 is handled properly, but the characters like x92, x93 and x94 are not normalized and are as it is in the result string.

Asked By: Ruchit

||

Answers:

Your data was decoded as ISO-8859-1 (aka latin1), but those Unicode code points are control characters in that encoding. In Windows-1252 (aka cp1252) they are so-called smart quotes:

>>> 'x92x93x94'.encode('latin1').decode('cp1252')
'’“”'

They also don’t change when normalized, but at least they display correctly if decoded properly:

>>> ud.normalize('NFKC','x92x93x94'.encode('latin1').decode('cp1252'))
'’“”'
>>> print(s.encode('latin1').decode('cp1252'))
[email protected]
Facsimile
No.:  312-233-2266
 
JPMorgan Chase Bank,
N.A., as Administrative Agent
10 South Dearborn, Floor 7th
IL1-0010
Chicago, IL 60603-2003
Attention:  Hiral Patel
Facsimile No.:  312-385-7096
 
Ladies and Gentlemen:
 
Reference is made to the
Credit Agreement, dated as of May 7, 2010 (as the same may be amended,
restated, supplemented or otherwise modified from time to time, the “Credit Agreement”), by and among
Hawaiian Electric Industries, Inc., a Hawaii corporation (the “Borrower”), the Lenders from time to
time party thereto and JPMorgan Chase Bank, N.A., as issuing bank and
administrative agent (the “Administrative Agent”).

Note the xa0 code point is U+00A0 (NO-BREAK SPACE) and canonically normalizes to a SPACE:

>>> ud.name('xa0')
'NO-BREAK SPACE'
>>> ud.normalize('NFKC','xa0')
' '
>>> ud.name(ud.normalize('NFKC','xa0'))
'SPACE'

It prints correctly without normalization:

>>> print('helloxa0there')
hello there
Answered By: Mark Tolonen

unicodedata.normalize is not meant to "remove […] characters".
It is there so that Unicode strings that might be equivalent, but written with different representations, can be cast to a uniform representation — but it will not mutilate the text to drop characters that "don’t look good". What happens with xa0 (non-breaking space) in particular is that it is equivalent to a common plain space (x20) in the normalized forms, and is therefore replaced by that.

That said, it looks like the application that generates the data you are consuming just includes these characters with semantic purpose; their meaning is here: C0 and C1 control codes – Wikipedia. If you want just to discard that information and preserve other non-ASCII characters in your text, a replace for all characters in the C1 block range, after normalizing, will do the job. re.sub might be nice due to being able to allow the selection of a character range:

import re
...
s1 = normalize("NFKC", s)
s2 = re.sub("[x1f-x9f]", "", s1)

If you want to simply drop all non-ASCII characters (not recommended if your source is not ASCII + ctrl characters only), on the other hand, just encode the text with "ignore" as error policy:

s2 = s1.decode("ASCII", errors="ignore").decode()
Answered By: jsbueno