find pattern substring using python re

Question

I am trying to find all substrings within a multi string in python 3, I want to find all words in between the word ‘Colour:’:

example string:

str = """
Colour: Black
Colour: Green
Colour: Black
Colour: Red
Colour: Orange
Colour: Blue
Colour: Green
"""

I want to get all of the colours into a list like:

x = ['Black', 'Green', 'Black', 'Red', 'Orange', 'Blue', 'Green']

I want to do this using Python re

Whats the fastest way of doing this with re.search , re.findall, re.finditer or even another method.

I’ve tried doing this as a list comprehension:

z = [x.group() for x in re.finditer('Colour:(.*?)Colour:', str)]

but it returns an empty list ?

any ideas?

Asked By: milo2810

||

Answer 1

In regex, the dot . does not match new line by default. This mean your program is trying to find something like "Color: blueColor".

To overcome this, you can just do something like :

colours = re.findall(r'Colour: (.+)', str)

Note the use of re.findall to avoid using the list comprehension.

Furthermore, if the format won’t change, regex is not mandatory and you can just split each line on spaces and get the second part :

colours = [line.split()[1] for line in str.splitlines()]

Answered By: charon25

Answer 2

The lists containing the trailing spaces can be removed and split based on the user-defined variable. In your case, the Colour:.

list(filter(None, str.replace("n", "").replace(" ", "").split("Colour:")))

Result:

['Black', 'Green', 'Black', 'Red', 'Orange', 'Blue', 'Green']

Regex patterns are subjected to taking more time than dealing with strings directly.

Adding the image for reference:

Answered By: Roxy

Answer 3

Perhaps you just need a simple one-liner:

x = re.findall("Colour: (.*)",str)

This worked for your example.

(P.S. please don’t use builtin symbols like str for variable names.)

Answered By: RufusVS

Question: