Regex lambda doesn't iterate through df – prints first row's result for all

Question

def split_it(email):
    return re.findall(r"[^@]+@[^@]+(?:.com|.se|.br|.org)", s)

df['email_list'] = df['email'].apply(lambda x: split_it(x))

This code seems to work for the first row of the df, but then will print the result of the first row on all other rows.

Is it not iterating through all rows? Or does it print the result of row 1 on all rows?

Asked By: k_osterlund

||

Source

Answer 1

You do not need to use apply here, use Series.str.findall directly:

df['email_list'] = df['email'].str.findall(r"[^@]+@[^@]+(?:.com|.se|.br|.org)")

If there are several emails per row, you can join the results:

df['email_list'] = df['email'].str.findall(r"[^@]+@[^@]+(?:.com|.se|.br|.org)").str.join(", ")

Note that the email pattern can be enhanced in many ways, but I would add s into the negated character classes to exclude whitespace matching, and move . outside the group to avoid repetition:

r"[^s@]+@[^s@]+.(?:com|se|br|org)"

Answered By: Wiktor Stribiżew

Regex lambda doesn't iterate through df – prints first row's result for all

Question:

Answers: