Removing xf characters

Question:

I am trying to remove all

xf0x9fx93xa2, xf0x9fx95x91n, xe2x80xa6,xe2x80x99t 

type characters from the below strings in Python

    Text
  _____________________________________________________
"b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6

"b'I doubt if climate emergency 8s real, I think people will look baxe2x80xa6 '

"b'No, thankfully it doesnxe2x80x99t. Canxe2x80x99t see how cheap to overtourism in the alan alps can hxe2x80xa6"

"b'Climate Change Poses a WidelllThreat to National Security "

"b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"

"b'berates climate change activist who confronted her in airportxc2xa0 

The above content is in pandas dataframe as a column..

I am trying

string.encode('ascii', errors= 'ignore') 

and regex but without luck. It will be helpful if I can get some suggestions.

Asked By: shan

||

Answers:

try decoding the bytes.

text=b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6'.decode("utf8")
print(text) 
>> Hello!   End Climate Silence is looking for volunteers! 

1-2 hours per week.  
Answered By: Bendik Knapstad

Your current data in the question indicates you are using bytestring representations re-encoded as Unicode strings, and now you want to decode those bytestrings, but first, you need to encode the strings back to the bytestrings.

So, in your case, you can use

x = "b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"
x = x.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
print(x)
# => bThis doesn't feel like targeted propaganda at all. I mean states…

See this Python demo.

Answered By: Wiktor Stribiżew

These are hexadecimal escape characters which arrives after encoding.
All occurences of type x[AB] where A or B can be [0123456789abcdefABCDEF] can be considered of this form. Try using regex with a pattern. \x[0123456789abcdefABCDEF][0123456789abcdefABCDEF]

Answered By: Mann Jain
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.