Removing xf characters
Question:
I am trying to remove all
xf0x9fx93xa2, xf0x9fx95x91n, xe2x80xa6,xe2x80x99t
type characters from the below strings in Python
Text
_____________________________________________________
"b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6
"b'I doubt if climate emergency 8s real, I think people will look baxe2x80xa6 '
"b'No, thankfully it doesnxe2x80x99t. Canxe2x80x99t see how cheap to overtourism in the alan alps can hxe2x80xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"
"b'berates climate change activist who confronted her in airportxc2xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.
Answers:
try decoding the bytes.
text=b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6'.decode("utf8")
print(text)
>> Hello! End Climate Silence is looking for volunteers!
1-2 hours per week.
Your current data in the question indicates you are using bytestring representations re-encoded as Unicode strings, and now you want to decode those bytestrings, but first, you need to encode the strings back to the bytestrings.
So, in your case, you can use
x = "b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"
x = x.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
print(x)
# => bThis doesn't feel like targeted propaganda at all. I mean states…
See this Python demo.
These are hexadecimal escape characters which arrives after encoding.
All occurences of type x[AB] where A or B can be [0123456789abcdefABCDEF
] can be considered of this form. Try using regex with a pattern. \x[0123456789abcdefABCDEF][0123456789abcdefABCDEF]
I am trying to remove all
xf0x9fx93xa2, xf0x9fx95x91n, xe2x80xa6,xe2x80x99t
type characters from the below strings in Python
Text
_____________________________________________________
"b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6
"b'I doubt if climate emergency 8s real, I think people will look baxe2x80xa6 '
"b'No, thankfully it doesnxe2x80x99t. Canxe2x80x99t see how cheap to overtourism in the alan alps can hxe2x80xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"
"b'berates climate change activist who confronted her in airportxc2xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.
try decoding the bytes.
text=b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6'.decode("utf8")
print(text)
>> Hello! End Climate Silence is looking for volunteers!
1-2 hours per week.
Your current data in the question indicates you are using bytestring representations re-encoded as Unicode strings, and now you want to decode those bytestrings, but first, you need to encode the strings back to the bytestrings.
So, in your case, you can use
x = "b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"
x = x.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
print(x)
# => bThis doesn't feel like targeted propaganda at all. I mean states…
See this Python demo.
These are hexadecimal escape characters which arrives after encoding.
All occurences of type x[AB] where A or B can be [0123456789abcdefABCDEF
] can be considered of this form. Try using regex with a pattern. \x[0123456789abcdefABCDEF][0123456789abcdefABCDEF]