Python and regex to remove parenthesis in a file

Question:

xml file with about 2000 (texthere) parenthesis. I need to remove the parenthesis and text within it. I tried but am getting an error 🙁

import re, sys

    fileName = (sys.argv[2])


    with open(fileName) as f:

        input = f.read()
        output = re.sub(r'(w*)', '', input)
        print fileName + " cleaned of all parenthesis"

and my error :

Traceback (most recent call last):
  File "/Users/eeamesX/work/data-scripts/removeParenFromXml.py", line 4, in <module>
    fileName = (sys.argv[2])
IndexError: list index out of range

I changed the (sys.argv[1])…I get no errors but also the parenthesis in my file.xml do not get removed?

Asked By: Anekdotin

||

Answers:

Do you have nested parens?

stuff (words (inside (other) words) eww)

Will you have multiple groups of parens?

stuff (first group) stuff (second group)

Does text within parens have spaces?

stuff (single_word)
stuff (multiple words)

A simple regex could be (.*?) although you’ll see that the nested parens are not caught (which is fine if you do NOT expect nested parens):

https://regex101.com/r/kB2lU1/1

Edit:

https://regex101.com/r/kB2lU1/2 may be able to handle some of those nested parens, but may still break depending on different types of edge cases.

You’ll need to specify what kinds of edge cases you expect so the answer can be better tailored to your needs.

Answered By: OnlineCop

Since you’re calling the script as follows:

python removeparenthesis.py filename.xml

the XML file name will appear under sys.argv[1].

Also, you’d need to use lazy matching in your pattern:

r'(w*?)'    # notice the ?

A better pattern would be:

r'([^)]*)'
Answered By: hjpotter92
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.