How to split long regular expression rules to multiple lines in Python

Question:

Is this actually doable? I have some very long regex pattern rules that are hard to understand because they don’t fit into the screen at once. Example:

test = re.compile(
    '(?P<full_path>.+):d+:s+warning:s+Members+(?P<member_name>.+)s+((?P<member_type>%s)) of (class|group|namespace)s+(?P<class_name>.+)s+is not documented'
        % (self.__MEMBER_TYPES),
    re.IGNORECASE)

Backslash or triple quotes won’t work.

Asked By: Makis

||

Answers:

You can split your regex pattern by quoting each segment. No backslashes needed.

test = re.compile(
    ('(?P<full_path>.+):d+:s+warning:s+Member'
     's+(?P<member_name>.+)s+((?P<member_type>%s)) '
     'of (class|group|namespace)s+(?P<class_name>.+)'
     's+is not documented'
    ) % (self.__MEMBER_TYPES),
    re.IGNORECASE)

You can also use the raw string flag 'r' and you’ll have to put it before each segment.

See the docs: String literal concatenation

Answered By: naeg

From the docs, String literal concatenation:

Multiple adjacent string literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )

Note that this feature is defined at the syntactical level, but implemented at compile time. The ‘+’ operator must be used to concatenate string expressions at run time. Also note that literal concatenation can use different quoting styles for each component (even mixing raw strings and triple quoted strings).

Answered By: N3dst4

The Python compiler will automatically concatenate adjacent string literals. So one way you can do this is to break up your regular expression into multiple strings, one on each line, and let the Python compiler recombine them. It doesn’t matter what whitespace you have between the strings, so you can have line breaks and even leading spaces to align the fragments meaningfully.

Answered By: Ben

Either use string concatenation like in naeg’s answer or use re.VERBOSE/re.X, but be careful this option will ignore whitespace and comments. You have some spaces in your regex, so those would be ignored and you need to either escape them or use s

So e.g.

test = re.compile(
    """
        (?P<full_path>.+):d+: # some comment
        s+warning:s+Members+(?P<member_name>.+) #another comment
        s+((?P<member_type>%s)) of (class|group|namespace)s+
        (?P<class_name>.+)s+is not documented
    """ % (self.__MEMBER_TYPES),
    re.IGNORECASE | re.X)
Answered By: stema

Personally, I don’t use re.VERBOSE because I don’t like to escape the blank spaces and I don’t want to put ‘s’ instead of blank spaces when ‘s’ isn’t required.
The more the symbols in a regex pattern are precise relatively to the characters sequences that must be catched, the faster the regex object acts. I nearly never use ‘s’

.

To avoid re.VERBOSE, you can do as it has been already said:

test = re.compile(
'(?P<full_path>.+)'
':d+:s+warning:s+Members+' # comment
'(?P<member_name>.+)'
's+('
'(?P<member_type>%s)' # comment
') of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
's+'
'(?P<class_name>.+)'
#      vvv overlining something important too
's+is not documented'
% (self.__MEMBER_TYPES),

re.IGNORECASE)

Pushing the strings to the left gives a lot of space to write comments.

.

But this manner isn’t so good when the pattern is very long because it isn’t possible to write

test = re.compile(
'(?P<full_path>.+)'
':d+:s+warning:s+Members+' # comment
'(?P<member_name>.+)'
's+('
'(?P<member_type>%s)' % (self.__MEMBER_TYPES)  # !!!!!! INCORRECT SYNTAX !!!!!!!
') of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
's+'
'(?P<class_name>.+)'
#      vvv overlining something important too
's+is not documented',

re.IGNORECASE)

then in case the pattern is very long, the number of lines between
the part % (self.__MEMBER_TYPES) at the end
and the string '(?P<member_type>%s)' to which it is applied
can be big and we loose the easiness in reading the pattern.

.

That’s why I like to use a tuple to write a very long pattern:

pat = ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':d+:s+warning:s+Members+',
'(?P<member_name>.+)',
's+(',
'(?P<member_type>%s)' % (self.__MEMBER_TYPES), # comment here
') of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
's+',
'(?P<class_name>.+)',
#      vvv overlining something important too
's+is not documented'))

.

This manner allows to define the pattern as a function:

def pat(x):

    return ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':d+:s+warning:s+Members+',
'(?P<member_name>.+)',
's+(',
'(?P<member_type>%s)' % x , # comment here
') of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
's+',
'(?P<class_name>.+)',
#      vvv overlining something important too
's+is not documented'))

test = re.compile(pat(self.__MEMBER_TYPES), re.IGNORECASE)
Answered By: eyquem

Use the re.X or re.VERBOSE flag. Besides saving quotes, this method is also portable on other regex implementations such as Perl.

From the docs:

re.X

re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:

a = re.compile(r"""d +  # the integral part
                   .    # the decimal point
                   d *  # some fractional digits""", re.X)
b = re.compile(r"d+.d*")

Corresponds to the inline flag (?x).


P.S. The OP said they ended up using this solution in a previous edit to the question.

Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.