Removing xf characters

Question

I am trying to remove all

xf0x9fx93xa2, xf0x9fx95x91n, xe2x80xa6,xe2x80x99t

type characters from the below strings in Python

    Text
  _____________________________________________________
"b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6

"b'I doubt if climate emergency 8s real, I think people will look baxe2x80xa6 '

"b'No, thankfully it doesnxe2x80x99t. Canxe2x80x99t see how cheap to overtourism in the alan alps can hxe2x80xa6"

"b'Climate Change Poses a WidelllThreat to National Security "

"b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"

"b'berates climate change activist who confronted her in airportxc2xa0

The above content is in pandas dataframe as a column..

I am trying

string.encode('ascii', errors= 'ignore')

and regex but without luck. It will be helpful if I can get some suggestions.

Asked By: shan

||

Source

Answer 1

try decoding the bytes.

text=b'Hello! xf0x9fx93xa2 End Climate Silence is looking for volunteers! nn1-2 hours per week. xf0x9fx95x91nnExperience doing digital researchxe2x80xa6'.decode("utf8")
print(text) 
>> Hello!   End Climate Silence is looking for volunteers! 

1-2 hours per week.

Answered By: Bendik Knapstad

Answer 2

Your current data in the question indicates you are using bytestring representations re-encoded as Unicode strings, and now you want to decode those bytestrings, but first, you need to encode the strings back to the bytestrings.

So, in your case, you can use

x = "b""This doesn't feel like targeted propaganda at all. I mean statesxe2x80xa6"
x = x.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
print(x)
# => bThis doesn't feel like targeted propaganda at all. I mean states…

See this Python demo.

Answered By: Wiktor Stribiżew

Answer 3

These are hexadecimal escape characters which arrives after encoding.
All occurences of type x[AB] where A or B can be [0123456789abcdefABCDEF] can be considered of this form. Try using regex with a pattern. \x[0123456789abcdefABCDEF][0123456789abcdefABCDEF]

Answered By: Mann Jain

Removing xf characters

Question:

Answers: