i know similar questions have been asked before but so far i wasnt able to solve my problem, so apologies in advance.
I have a json-file (‘test.json’) with text in it. The text appears like this:
"... >>rn>> This is a test.>rn> rn-- rnMit freundlichen GrüssenrnrnMike Klence ..."
The overal output should be the plain text:
"... This is a test. Mit freundlichen Grüssen Mike Klence ..."
With beautifulsoup i got to remove those html tags. But still those >, r, n- – remain in the text. So i tried the following code:
import codecs from bs4 import BeautifulSoup with codecs.open('test.json', encoding = 'utf-8') as f: soup = BeautifulSoup(f, 'lxml') invalid_tags = ['r', 'n', '<', '>'] for tag in invalid_tags: for match in soup.find_all(tag): match.replace_with() print(soup.get_text())
But it doesnt do anything with the text in the file. I tried different variations but nothing seems to change at all.
How can i get my code to work properly?
Or if there is another, easier or faster way, i would be thankful to read about those approaches as well.
Btw i am using python 3.6 on anaconda.
Thank you very much in advance for your help.
You could do this using python built-in function
with open('test.json', 'r', encoding = 'utf-8') as f: content = f.read() invalid_tags = ['\r', '\n', '<', '>', '-', ';'] for invalid_tag in invalid_tags: content = content.replace(invalid_tag, '') content = content.replace('&u', 'ü') print(content)
... This is a test. Mit freundlichen GrüumlssenMike Klence ...
You could also try this one liner using
import re string = "... >>rn>> This is a test.>rn> rn-- rnMit freundlichen GrüssenrnrnMike Klence ..." updatedString = ''.join(re.split(r'[rn<>]+',string)) print(updatedString)