Remove tags (r, n, <, >) from string in json-file

Question:

i know similar questions have been asked before but so far i wasnt able to solve my problem, so apologies in advance.

I have a json-file (‘test.json’) with text in it. The text appears like this:

"... >>rn>> This is a test.>rn> rn-- rnMit freundlichen Gr&uuml;ssenrnrnMike Klence ..."

The overal output should be the plain text:

"... This is a test. Mit freundlichen Grüssen Mike Klence ..."

With beautifulsoup i got to remove those html tags. But still those >, r, n- – remain in the text. So i tried the following code:

import codecs
from bs4 import BeautifulSoup

with codecs.open('test.json', encoding = 'utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')
    invalid_tags = ['r', 'n', '<', '>']
    for tag in invalid_tags: 
        for match in soup.find_all(tag):
            match.replace_with()

print(soup.get_text())

But it doesnt do anything with the text in the file. I tried different variations but nothing seems to change at all.

How can i get my code to work properly?
Or if there is another, easier or faster way, i would be thankful to read about those approaches as well.

Btw i am using python 3.6 on anaconda.

Thank you very much in advance for your help.

Asked By: Mike Twain

||

Answers:

You could do this using python built-in function replace().

with open('test.json', 'r', encoding = 'utf-8') as f:
    content = f.read()
    invalid_tags = ['\r', '\n', '<', '>', '-', ';']
    for invalid_tag in invalid_tags:
        content = content.replace(invalid_tag, '')
    content = content.replace('&u', 'ü')

print(content)

Output:

...  This is a test.  Mit freundlichen GrüumlssenMike Klence ...
Answered By: Filip Młynarski

You could also try this one liner using regex.

import re

string = "... >>rn>> This is a test.>rn> rn-- rnMit freundlichen Gr&uuml;ssenrnrnMike Klence ..."
updatedString = ''.join(re.split(r'[rn<>]+',string))

print(updatedString)
Answered By: Nazmu Masood
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.