Email Validation using Regular Expressions Pandas Dataframe
Question:
I would like to do a simple email validation for list import of email addresses into a database. I just want to make sure that there is content before the @ sign, an @ sign, content after the @ sign, and 2+ characters after the ‘.’ . Here is a sample df:
import pandas as pd
import re
errors= {}
data= {'First Name': ['Sally', 'Bob', 'Sue', 'Tom', 'Will'],
'Last Name': ['William', '', 'Wright', 'Smith','Thomas'],
'Email Address': ['[email protected]','[email protected]','[email protected]','[email protected]','']}
df=pd.DataFrame(data)
This is the expression I was using to check for valid emails:
regex = re.compile(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(.[A-Z|a-z]{2,})+')
def isValid(email):
if re.fullmatch(regex, email):
pass
else:
return("Invalid email")
This regex is working fine but I am not sure how to easily loop through my entire df email address column. I have tried:
for col in df['Email Address'].columns:
for i in df['Email Address'].index:
if df.loc[i,col] = 'Invalid email'
errors={'row':i, 'column':col, 'message': 'this is not a valid email address'
I am wanting to write the invalid email to a dictionary titled errors. with the above code I get an invalid error.
Answers:
You can iterate through rows using .iterrows() on a dataframe. row contains a series and you can access your column the same way you would a dictionary.
for i, row in df.iterrows():
if not isValid(row['Email Address']):
print("Invalid email")
The beautiful thing about Pandas dataframes is that you almost never have to loop through them–and avoiding loops will increase your speed significantly.
df['Email Address'].str.contains(regex)
will return a boolean Series of whether each observation in the Email Address column.
Check out this chapter on vectorized string operations for more.
According to your description, I’d probably do
df["Email Address"].str.match(r"^.+@.+..{2,}$")
str.match
returns True
if the regex matches the string.
The regex is
- the start of the string
^
- content before the @ sign
.+
- an @ sign
@
- content after the @ sign
.+
- a dot
.
- and 2+ characters after the ‘.’
.{2,}
I would like to do a simple email validation for list import of email addresses into a database. I just want to make sure that there is content before the @ sign, an @ sign, content after the @ sign, and 2+ characters after the ‘.’ . Here is a sample df:
import pandas as pd
import re
errors= {}
data= {'First Name': ['Sally', 'Bob', 'Sue', 'Tom', 'Will'],
'Last Name': ['William', '', 'Wright', 'Smith','Thomas'],
'Email Address': ['[email protected]','[email protected]','[email protected]','[email protected]','']}
df=pd.DataFrame(data)
This is the expression I was using to check for valid emails:
regex = re.compile(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(.[A-Z|a-z]{2,})+')
def isValid(email):
if re.fullmatch(regex, email):
pass
else:
return("Invalid email")
This regex is working fine but I am not sure how to easily loop through my entire df email address column. I have tried:
for col in df['Email Address'].columns:
for i in df['Email Address'].index:
if df.loc[i,col] = 'Invalid email'
errors={'row':i, 'column':col, 'message': 'this is not a valid email address'
I am wanting to write the invalid email to a dictionary titled errors. with the above code I get an invalid error.
You can iterate through rows using .iterrows() on a dataframe. row contains a series and you can access your column the same way you would a dictionary.
for i, row in df.iterrows():
if not isValid(row['Email Address']):
print("Invalid email")
The beautiful thing about Pandas dataframes is that you almost never have to loop through them–and avoiding loops will increase your speed significantly.
df['Email Address'].str.contains(regex)
will return a boolean Series of whether each observation in the Email Address column.
Check out this chapter on vectorized string operations for more.
According to your description, I’d probably do
df["Email Address"].str.match(r"^.+@.+..{2,}$")
str.match
returns True
if the regex matches the string.
The regex is
- the start of the string
^
- content before the @ sign
.+
- an @ sign
@
- content after the @ sign
.+
- a dot
.
- and 2+ characters after the ‘.’
.{2,}