Easiest way to clean email address in python

Question:

I am having issues with emails address and with a small correction, they are can be converted to valid email addresses.

For Ex:

%[email protected], --- Not valid
'[email protected],  --- Not valid
([email protected]),  --- Not valid
([email protected]),  --- Not valid
:[email protected],  --- Not valid
//[email protected]  --- Not valid
[email protected]    ---  valid
...

I could write "if else", but if a new email address comes with new issues, I need to write "ifelse " and update every time.

What is the best way to clean all these small issues, some python packes or regex? PLease suggest.

Asked By: Xi12

||

Answers:

Data clean-up is messy but I found the approach of defining a set of rules to be an easy way to manage this (order of the rules matters):

rules = [
        lambda s: s.replace('%20', ' '),
        lambda s: s.strip(" ,'"),
]

addresses = [
        '%[email protected],',
        '[email protected],'
]

for a in addresses:
    for r in rules:
        a = r(a)
    print(a)

and here is the resulting output:

[email protected]
[email protected]

Make sure you write a test suite that covers both invalid and valid data. It’s easy break, and you may be tweaking the rules often.

While I used lambda for the rules above, it can be an arbitrary complex function that accepts and return a string.

Answered By: Allan Wind

You can do this (I basically check if the elements in the email are alpha characters or a point, and remove them if not so):

emails = [
    '[email protected]', 
    '([email protected])', 
    '([email protected])',  
    ':[email protected]',  
    '//[email protected]',
    '[email protected]'
    ]

def correct_email_format(email):
    return ''.join(e for e in email if (e.isalnum() or e in ['.', '@']))

for email in emails:
    corrected_email = correct_email_format(email)
    print(corrected_email)

output:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Answered By: mrCopiCat

This is actually a complicated one if you have more different test cases, ie., [email protected]., ,,[email protected].@_) or [email protected]@@@. Still, you can strip them but it cannot be limited to what is to be stripped at the end and beginning.

Note: Email addresses with numbers and _ are valid too.

emails = [
    '[email protected]', 
    '([email protected])', 
    '([email protected])',  
    ':[email protected]',  
    '//[email protected]',
    '[email protected]'
    ]

def clean(email):
  return ''.join(filter(lambda x: ord(x) >= 65 or x in ['.', '@', '_'], email))

for email in emails:
  print(clean(email))

Output:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.