Removing wrapped line returns

Question:

I want to remove the line returns of a text that is wrapped to a certain width. e.g.

import re
x = 'the meaningnof life'
re.sub("([,w])n(w)", "1 2", x)
'the meaninx01 x02f life'

I want to return the meaning of life. What am I doing wrong?

Asked By: geotheory

||

Answers:

You need escape that like this:

>>> import re
>>> x = 'the meaningnof life'

>>> re.sub("([,w])n(w)", "1 2", x)
'the meaninx01 x02f life'

>>> re.sub("([,w])n(w)", "\1 \2", x)
'the meaning of life'

>>> re.sub("([,w])n(w)", r"1 2", x)
'the meaning of life'
>>>

If you don’t escape it, the output is 1, so:

>>> '1'
'x01'
>>> 

That’s why we need use '\\' or r'\'to display a signal in Python RegEx.

However about that, from this answer:

If you’re putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when “de-escaping” it for the string, and then the regex needs two for an escaped regex backslash).

And the document:

As stated earlier, regular expressions use the backslash character ('') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.

Let’s say you want to write a RE that matches the string section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \section. The resulting string that must be passed to re.compile() must be \section. However, to express this as a Python string literal, both backslashes must be escaped again.


Another way as brittenb suggested, you don’t need RegEx in this case:

>>> x = 'the meaningnof life'
>>> x.replace("n", " ")
'the meaning of life'
>>> 
Answered By: Remi Guan

Use raw string literals; both Python string literal syntax and regex interpret backslashes; 1 in a python string literal is interpreted as an octal escape, but not in a raw string literal:

re.sub(r"([,w])n(w)", r"1 2", x)

The alternative would be to double all backslashes so that they reach the regex engine as such.

See the Backslash plague section of the Python regex HOWTO.

Demo:

>>> import re
>>> x = 'the meaningnof life'
>>> re.sub(r"([,w])n(w)", r"1 2", x)
'the meaning of life'

It might be easier just to split on newlines; use the str.splitlines() method, then re-join with spaces using str.join():

' '.join(ex.splitlines())

but admittedly this won’t distinguish between newlines between words and extra newlines elsewhere.

Answered By: Martijn Pieters
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.