match multiple substrings using findall from re library

Question:

I have a large array that contains strings with the following format in Python

some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE]

I just need to extract the substrings that start with MATH, SCIENCE and ART. So what I’m currently using

  my_str = re.findall('MATH_.*? ', some_array )

    if len(my_str) > 0:
        print(my_str)

    my_str = re.findall('SCIENCE_.*? ', some_array )

    if len(my_str) !=0:
        print(my_str)

    my_str = re.findall('ART_.*? ', some_array )

    if len(my_str) > 0:
        print(my_str)

It seems to work, but I was wondering if the findall function can look for more than one substring in the same line or maybe there is a cleaner way of doing it with another function. Thanks.

Asked By: pekoms

||

Answers:

You can use | to match multiple different strings in a regular expression.

re.findall('(?:MATH|SCIENCE|ART)_.*? ', ...)

You could also use str.startswith along with a list comprehension.

res = [x for x in some_array if any(x.startswith(prefix) 
          for prefix in ('MATH', 'SCIENCE', 'ART'))]
Answered By: Unmitigated

You could also match optional non whitespace characters after one of the alternations, start with a word boundary to prevent a partial word match and match the trailing single space:

b(?:MATH|SCIENCE|ART)_S*

Regex demo

Or if only word characters w:

b(?:MATH|SCIENCE|ART)_w*

Example

import re

some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
              'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
              'ART_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE']

pattern = re.compile(r"b(?:MATH|SCIENCE|ART)_S* ")
for s in some_array:
    print(pattern.findall(s))

Output

['MATH_SOME_TEXT_AND_NUMBER ']
['SCIENCE_SOME_TEXT_AND_NUMBER ']
['ART_SOME_TEXT_AND_NUMBER ']
Answered By: The fourth bird
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.