Easiest way to clean email address in python
Question:
I am having issues with emails address and with a small correction, they are can be converted to valid email addresses.
For Ex:
%[email protected], --- Not valid
'[email protected], --- Not valid
([email protected]), --- Not valid
([email protected]), --- Not valid
:[email protected], --- Not valid
//[email protected] --- Not valid
[email protected] --- valid
...
I could write "if else", but if a new email address comes with new issues, I need to write "ifelse " and update every time.
What is the best way to clean all these small issues, some python packes or regex? PLease suggest.
Answers:
Data clean-up is messy but I found the approach of defining a set of rules to be an easy way to manage this (order of the rules matters):
rules = [
lambda s: s.replace('%20', ' '),
lambda s: s.strip(" ,'"),
]
addresses = [
'%[email protected],',
'[email protected],'
]
for a in addresses:
for r in rules:
a = r(a)
print(a)
and here is the resulting output:
[email protected]
[email protected]
Make sure you write a test suite that covers both invalid and valid data. It’s easy break, and you may be tweaking the rules often.
While I used lambda for the rules above, it can be an arbitrary complex function that accepts and return a string.
You can do this (I basically check if the elements in the email are alpha characters or a point, and remove them if not so):
emails = [
'[email protected]',
'([email protected])',
'([email protected])',
':[email protected]',
'//[email protected]',
'[email protected]'
]
def correct_email_format(email):
return ''.join(e for e in email if (e.isalnum() or e in ['.', '@']))
for email in emails:
corrected_email = correct_email_format(email)
print(corrected_email)
output:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
This is actually a complicated one if you have more different test cases, ie., [email protected].
, ,,[email protected].@_)
or [email protected]@@@
. Still, you can strip them but it cannot be limited to what is to be stripped at the end and beginning.
Note: Email addresses with numbers and _ are valid too.
emails = [
'[email protected]',
'([email protected])',
'([email protected])',
':[email protected]',
'//[email protected]',
'[email protected]'
]
def clean(email):
return ''.join(filter(lambda x: ord(x) >= 65 or x in ['.', '@', '_'], email))
for email in emails:
print(clean(email))
Output:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
I am having issues with emails address and with a small correction, they are can be converted to valid email addresses.
For Ex:
%[email protected], --- Not valid
'[email protected], --- Not valid
([email protected]), --- Not valid
([email protected]), --- Not valid
:[email protected], --- Not valid
//[email protected] --- Not valid
[email protected] --- valid
...
I could write "if else", but if a new email address comes with new issues, I need to write "ifelse " and update every time.
What is the best way to clean all these small issues, some python packes or regex? PLease suggest.
Data clean-up is messy but I found the approach of defining a set of rules to be an easy way to manage this (order of the rules matters):
rules = [
lambda s: s.replace('%20', ' '),
lambda s: s.strip(" ,'"),
]
addresses = [
'%[email protected],',
'[email protected],'
]
for a in addresses:
for r in rules:
a = r(a)
print(a)
and here is the resulting output:
[email protected]
[email protected]
Make sure you write a test suite that covers both invalid and valid data. It’s easy break, and you may be tweaking the rules often.
While I used lambda for the rules above, it can be an arbitrary complex function that accepts and return a string.
You can do this (I basically check if the elements in the email are alpha characters or a point, and remove them if not so):
emails = [
'[email protected]',
'([email protected])',
'([email protected])',
':[email protected]',
'//[email protected]',
'[email protected]'
]
def correct_email_format(email):
return ''.join(e for e in email if (e.isalnum() or e in ['.', '@']))
for email in emails:
corrected_email = correct_email_format(email)
print(corrected_email)
output:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
This is actually a complicated one if you have more different test cases, ie., [email protected].
, ,,[email protected].@_)
or [email protected]@@@
. Still, you can strip them but it cannot be limited to what is to be stripped at the end and beginning.
Note: Email addresses with numbers and _ are valid too.
emails = [
'[email protected]',
'([email protected])',
'([email protected])',
':[email protected]',
'//[email protected]',
'[email protected]'
]
def clean(email):
return ''.join(filter(lambda x: ord(x) >= 65 or x in ['.', '@', '_'], email))
for email in emails:
print(clean(email))
Output:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]