How do I replace and add specific numbers in a string in a Pandas DataFrame?
Question:
I am currently trying to clean a column of data, which contains the phone numbers of users. The phone numbers are not consistent in their format and need to be standardised.
For example:
import pandas as pd
data = {'Name': ['John', 'Dom', 'Jack', 'Sam', 'Fred', 'Harvey', 'Toby'],
'Phone': ['+49(0) 047905356', '(0161) 496 0674', '239.711.3836', '02984 08192',
'(0306) 999 0871', '0121x496x0225', '+44047905356']}
df = pd.DataFrame(data)
Now I’ve tried to use the following code to remove the special characters:
df['Phone'] = df['Phone'].replace('W','', regex=True)
This works, however, I want to replace the numbers that only contain a + sign followed by the code with ‘0’ to achieve the following:
Example of expected outputs:
Input: '+49(0) 047905356'
| Expected: '047905356'
Input: '+44047905356'
| Expected: '047905356'
But then I also want numbers without a ‘0’ at the beginning to include one, for example:
Input: '239.711.3836'
| Expected: '02397113836'
Answers:
If you want to have only numbers in your Phone column, you could use the regex [^0-9]
.
You can know if a string starts with 0 using str.startswith()
df['Phone'] = df['Phone'].replace('[^0-9]','', regex=True)
df['start_with_0'] = df.Phone.str.startswith("0")
df['needs_0'] = df.start_with_0.replace({True:"", False:"0"})
df['Phone_new'] = df.needs_0 + df.Phone
df
You can use requlare expression to achieve the desired result.
import re
import pandas as pd
data = {'Name': ['John', 'Dom', 'Jack', 'Sam', 'Fred', 'Harvey', 'Toby'],
'Phone': ['+49(0) 047905356', '(0161) 496 0674', '239.711.3836', '02984 08192',
'(0306) 999 0871', '0121x496x0225', '+44047905356']}
df = pd.DataFrame(data)
data = {'Name': ['John', 'Dom', 'Jack', 'Sam', 'Fred', 'Harvey', 'Toby'],
'Phone': ['+49(0) 047905356', '(0161) 496 0674', '239.711.3836', '02984 08192',
'(0306) 999 0871', '0121x496x0225', '+44047905356']}
df['Phone'] = df['Phone'].replace('D', '', regex=True)
df.loc[df['Phone'].str.startswith('+'), 'Phone'] = '0' + df['Phone'].str[1:]
df.loc[~df['Phone'].str.startswith('0'), 'Phone'] = '0' + df['Phone']
df['Phone'] = df['Phone'].str[:2] + '.' + df['Phone'].str[2:4] + '.' + df['Phone'].str[4:]
Output:
Name Phone
0 John 04.90.047905356
1 Dom 01.61.4960674
2 Jack 02.39.7113836
3 Sam 02.98.408192
4 Fred 03.06.9990871
5 Harvey 01.21.4960225
6 Toby 04.40.47905356
I am currently trying to clean a column of data, which contains the phone numbers of users. The phone numbers are not consistent in their format and need to be standardised.
For example:
import pandas as pd
data = {'Name': ['John', 'Dom', 'Jack', 'Sam', 'Fred', 'Harvey', 'Toby'],
'Phone': ['+49(0) 047905356', '(0161) 496 0674', '239.711.3836', '02984 08192',
'(0306) 999 0871', '0121x496x0225', '+44047905356']}
df = pd.DataFrame(data)
Now I’ve tried to use the following code to remove the special characters:
df['Phone'] = df['Phone'].replace('W','', regex=True)
This works, however, I want to replace the numbers that only contain a + sign followed by the code with ‘0’ to achieve the following:
Example of expected outputs:
Input: '+49(0) 047905356'
| Expected: '047905356'
Input: '+44047905356'
| Expected: '047905356'
But then I also want numbers without a ‘0’ at the beginning to include one, for example:
Input: '239.711.3836'
| Expected: '02397113836'
If you want to have only numbers in your Phone column, you could use the regex [^0-9]
.
You can know if a string starts with 0 using str.startswith()
df['Phone'] = df['Phone'].replace('[^0-9]','', regex=True)
df['start_with_0'] = df.Phone.str.startswith("0")
df['needs_0'] = df.start_with_0.replace({True:"", False:"0"})
df['Phone_new'] = df.needs_0 + df.Phone
df
You can use requlare expression to achieve the desired result.
import re
import pandas as pd
data = {'Name': ['John', 'Dom', 'Jack', 'Sam', 'Fred', 'Harvey', 'Toby'],
'Phone': ['+49(0) 047905356', '(0161) 496 0674', '239.711.3836', '02984 08192',
'(0306) 999 0871', '0121x496x0225', '+44047905356']}
df = pd.DataFrame(data)
data = {'Name': ['John', 'Dom', 'Jack', 'Sam', 'Fred', 'Harvey', 'Toby'],
'Phone': ['+49(0) 047905356', '(0161) 496 0674', '239.711.3836', '02984 08192',
'(0306) 999 0871', '0121x496x0225', '+44047905356']}
df['Phone'] = df['Phone'].replace('D', '', regex=True)
df.loc[df['Phone'].str.startswith('+'), 'Phone'] = '0' + df['Phone'].str[1:]
df.loc[~df['Phone'].str.startswith('0'), 'Phone'] = '0' + df['Phone']
df['Phone'] = df['Phone'].str[:2] + '.' + df['Phone'].str[2:4] + '.' + df['Phone'].str[4:]
Output:
Name Phone
0 John 04.90.047905356
1 Dom 01.61.4960674
2 Jack 02.39.7113836
3 Sam 02.98.408192
4 Fred 03.06.9990871
5 Harvey 01.21.4960225
6 Toby 04.40.47905356