Remove substring of digits from string (Python)

Question:

<elem1><elem2>20,000 Leagues Under the Sea1050251</elem2></elem1>
<elem1><elem2>1002321Robinson Crusoe1050251</elem2></elem1>

I’m working with an XML file and had to insert elements above extracted from it into another XML file. The problem is, I have no idea how to remove the id (7-digit substrings) used to track the position from the string. Removing characters between ">" and "<" isn’t feasible, because text sometimes starts with id and sometimes with title that begins with numbers.
What I’d need is something that could remove only and any 7-digit substrings from a string, but I’ve only found code that can do it for specified substrings

Asked By: Thorwyn

||

Answers:

You can try with regex:

import re


string = """<elem1><elem2>20,000 Leagues Under the Sea1050251</elem2></elem1>
<elem1><elem2>1002321Robinson Crusoe1050251</elem2></elem1>"""

pattern = re.compile(r"d{7}")  # pattern that matches exactly 7 consecutive ascii digits
result = pattern.sub("", string)  # returns a string where the matched pattern is replaced by the given string
print(result)

Output:

<elem1><elem2>20,000 Leagues Under the Sea</elem2></elem1>
<elem1><elem2>Robinson Crusoe</elem2></elem1>

Useful:

Answered By: Matiiss
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.