Regex lambda doesn't iterate through df – prints first row's result for all

Question:

def split_it(email):
    return re.findall(r"[^@]+@[^@]+(?:.com|.se|.br|.org)", s)

df['email_list'] = df['email'].apply(lambda x: split_it(x))

This code seems to work for the first row of the df, but then will print the result of the first row on all other rows.

Is it not iterating through all rows? Or does it print the result of row 1 on all rows?

Asked By: k_osterlund

||

Answers:

You do not need to use apply here, use Series.str.findall directly:

df['email_list'] = df['email'].str.findall(r"[^@]+@[^@]+(?:.com|.se|.br|.org)")

If there are several emails per row, you can join the results:

df['email_list'] = df['email'].str.findall(r"[^@]+@[^@]+(?:.com|.se|.br|.org)").str.join(", ")

Note that the email pattern can be enhanced in many ways, but I would add s into the negated character classes to exclude whitespace matching, and move . outside the group to avoid repetition:

r"[^s@]+@[^s@]+.(?:com|se|br|org)"
Answered By: Wiktor Stribiżew
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.