In Python, how to use re.sub() to replace all literal Unicode spaces?
Question:
In Python, when I use readlines() to read from a text file, something that was originally a space will become a literal Unicode character, as shown follows. Where u2009 is a space in the original text file.
So, I’m using re.sub() to replace these Unicode literal spaces with a normal space.
My code is as follows:
x = "Significant increases in all the lipoprotein fractions were observed in infected untreated mice compared with normal control mice. Treatment with 100 and 250u2009mg/kg G. lucidum extract produced significant reduction in serum total cholesterol (TC) and low-density cholesterol (LDL-C) contents compared with 500u2009mg/kg G. lucidum and CQ."
x = re.sub(r'[x0bx0cx1cx1dx1ex1fx85xa0u1680u2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200au2028u2029u202fu205fu3000]', " ", x)
I don’t know if I’m right?
Although the program looks normal, I’m not sure because I don’t understand regular expressions well enough.
Answers:
quick solution:
x = " ".join(x.split())
re.sub("[^S tnrfv]",' ',x)
should do the trick (based on docs.python.org: re
— Regular expression operations):
You know that []
is used to indicate a set of characters, and characters that are not within a range can be matched by complementing the set. If the first character of the set is '^'
, all the characters that are not in the set will be matched.
The regex pattern [^S tnrfv]
reads as
^
(U+005E, Circumflex Accent) Not (
S
(not a whitespace) or
- (Space) or
t
(Character Tabulation) or
n
(Line Feed (LF)) or
r
(Carriage Return (CR)) or
f
(Form Feed (FF)) or
v
(Line Tabulation)
)
Distributing the outer not (i.e., the complementing ^
in the character class) with De Morgan’s law, this is equivalent to “whitespace except any of [ tnrfv]
.”
Including both r
and n
in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and Windows-ish (CR+LF) newline conventions.
Included a space itself (we do not need translate a space to space)…
s
For Unicode (str
) patterns:
Matches Unicode whitespace characters (which includes [ tnrfv]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages)…
Partially applied the following answer: Regex – Match whitespace but not newlines
In Python, when I use readlines() to read from a text file, something that was originally a space will become a literal Unicode character, as shown follows. Where u2009 is a space in the original text file.
So, I’m using re.sub() to replace these Unicode literal spaces with a normal space.
My code is as follows:
x = "Significant increases in all the lipoprotein fractions were observed in infected untreated mice compared with normal control mice. Treatment with 100 and 250u2009mg/kg G. lucidum extract produced significant reduction in serum total cholesterol (TC) and low-density cholesterol (LDL-C) contents compared with 500u2009mg/kg G. lucidum and CQ."
x = re.sub(r'[x0bx0cx1cx1dx1ex1fx85xa0u1680u2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200au2028u2029u202fu205fu3000]', " ", x)
I don’t know if I’m right?
Although the program looks normal, I’m not sure because I don’t understand regular expressions well enough.
quick solution:
x = " ".join(x.split())
re.sub("[^S tnrfv]",' ',x)
should do the trick (based on docs.python.org: re
— Regular expression operations):
You know that []
is used to indicate a set of characters, and characters that are not within a range can be matched by complementing the set. If the first character of the set is '^'
, all the characters that are not in the set will be matched.
The regex pattern [^S tnrfv]
reads as
^
(U+005E, Circumflex Accent) Not (S
(not a whitespace) or- (Space) or
t
(Character Tabulation) orn
(Line Feed (LF)) orr
(Carriage Return (CR)) orf
(Form Feed (FF)) orv
(Line Tabulation)
)
Distributing the outer not (i.e., the complementing ^
in the character class) with De Morgan’s law, this is equivalent to “whitespace except any of [ tnrfv]
.”
Including both r
and n
in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and Windows-ish (CR+LF) newline conventions.
Included a space itself (we do not need translate a space to space)…
s
For Unicode (str
) patterns:
Matches Unicode whitespace characters (which includes [ tnrfv]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages)…
Partially applied the following answer: Regex – Match whitespace but not newlines