Pandas – Duplicate rows with phone numbers based on type
Question:
My dataframe is in the following format:
name
phone
other_phone
alice
(111) 111-1111
(222) 222-2222, (333) 333-3333
bob
(444) 444-4444
(555) 555-5555, (666) 666-6666
colin
(777) 777-7777
(888) 888-8888
david
(999) 999-9999
NaN
I want to split the other_phone
column on the comma (if there is one) and duplicate rows such that the output is:
name
phone
phone_type
alice
(111) 111-1111
work
alice
(222) 222-2222
other
alice
(333) 333-3333
other
bob
(444) 444-4444
work
bob
(555) 555-5555
other
bob
(666) 666-6666
other
colin
(777) 777-7777
work
colin
(888) 888-8888
other
david
(999) 999-9999
work
The overall goal is to prevent there being multiple phone numbers in a single row. How could I accomplish this?
Answers:
This will (obviously) take a little bit of reshaping! Specifically we’ll need to create 2 separate DataFrames
that have the necessary "phone" and "name", then your "phone_type"
is derived based on what DataFrame
you belong to.
-
The easiest part is to create a work_phones
DataFrame, since that’s already nicely represented by the data.
-
A little bit trickier will be to split out the "other_phone" data. We’ll need to split these values and then explode them to vertically stack all of the values. Finally we’ll need to reach back into the source DataFrame
to grab the correct "name"
s for each phone number.
-
Finally, we stick these 2 parts on top of each other via pd.concat
!
Assuming your DataFrame
is stored in a variable called df
…
import pandas as pd
# easy
work_phones = df[['name', 'phone']]
# little trickier
other_phones = (
df['other_phone'].str.split(',')
.explode()
.to_frame('phone')
# grab the correct name from the original DataFrame
.join(df['name'])
)
Now that we have our parts, we just need to stack them on top of each other with pd.concat
and a few arguments.
final = (
pd.concat([work_phones, other_phones], names=['phone_type'], keys=['work', 'other'])
.reset_index(level=0)
# sort the data to match OP output
.sort_values(['name', 'phone_type'], ascending=[True, False])
)
print(final)
phone_type name phone
0 work alice (111) 111-1111
0 other alice (222) 222-2222
0 other alice (333) 333-3333
1 work bob (444) 444-4444
1 other bob (555) 555-5555
1 other bob (666) 666-6666
2 work colin (777) 777-7777
2 other colin (888) 888-8888
3 work david (999) 999-9999
3 other david NaN
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['alice', 'bob', 'colin', 'david'],
'phone': ['111', '444', '777', '999'],
'other_phone':
['222, 333', '555, 666', '888', np.nan]})
print(df)
"""
name phone other_phone
0 alice 111 222, 333
1 bob 444 555, 666
2 colin 777 888
3 david 999 NaN
"""
df = (
df.join(
df.pop( 'other_phone').str.split(',',expand=True)
.stack()
.reset_index(level=1,drop=True)
.rename( 'other_phone')
)
)
print(df)
"""
name phone other_phone
0 alice 111 222
0 alice 111 333
1 bob 444 555
1 bob 444 666
2 colin 777 888
3 david 999 NaN
"""
workfones = df[['name','phone']]
print(workfones)
"""
name phone
0 alice 111
0 alice 111
1 bob 444
1 bob 444
2 colin 777
3 david 999
"""
homefones = df[['name','other_phone']].rename(columns={'other_phone':'phone'})
print(homefones)
"""
name other_phone
0 alice 222
0 alice 333
1 bob 555
1 bob 666
2 colin 888
3 david NaN
"""
res = pd.concat([workfones, homefones], names=['phone_type'], keys=['office', 'home'])
print(res)
"""
name phone
phone_type
office 0 alice 111
0 alice 111
1 bob 444
1 bob 444
2 colin 777
3 david 999
home 0 alice 222
0 alice 333
1 bob 555
1 bob 666
2 colin 888
3 david NaN
"""
res1 = (res.reset_index(level=0) )
print(res1)
"""
phone_type name phone
0 office alice 111
0 office alice 111
1 office bob 444
1 office bob 444
2 office colin 777
3 office david 999
0 home alice 222
0 home alice 333
1 home bob 555
1 home bob 666
2 home colin 888
3 home david NaN
"""
res2 = res1.sort_values(by=['name','phone_type'])
print(res2)
"""
phone_type name phone
0 home alice 222
0 home alice 333
0 office alice 111
0 office alice 111
1 home bob 555
1 home bob 666
1 office bob 444
1 office bob 444
2 home colin 888
2 office colin 777
3 home david NaN
3 office david 999
"""
res3 = res2.reset_index(level=0).drop('index', axis=1)
print(res3)
"""
phone_type name phone
0 home alice 222
1 home alice 333
2 office alice 111
3 office alice 111
4 home bob 555
5 home bob 666
6 office bob 444
7 office bob 444
8 home colin 888
9 office colin 777
10 home david NaN
11 office david 999
"""
My dataframe is in the following format:
name | phone | other_phone |
---|---|---|
alice | (111) 111-1111 | (222) 222-2222, (333) 333-3333 |
bob | (444) 444-4444 | (555) 555-5555, (666) 666-6666 |
colin | (777) 777-7777 | (888) 888-8888 |
david | (999) 999-9999 | NaN |
I want to split the other_phone
column on the comma (if there is one) and duplicate rows such that the output is:
name | phone | phone_type |
---|---|---|
alice | (111) 111-1111 | work |
alice | (222) 222-2222 | other |
alice | (333) 333-3333 | other |
bob | (444) 444-4444 | work |
bob | (555) 555-5555 | other |
bob | (666) 666-6666 | other |
colin | (777) 777-7777 | work |
colin | (888) 888-8888 | other |
david | (999) 999-9999 | work |
The overall goal is to prevent there being multiple phone numbers in a single row. How could I accomplish this?
This will (obviously) take a little bit of reshaping! Specifically we’ll need to create 2 separate DataFrames
that have the necessary "phone" and "name", then your "phone_type"
is derived based on what DataFrame
you belong to.
-
The easiest part is to create a
work_phones
DataFrame, since that’s already nicely represented by the data. -
A little bit trickier will be to split out the "other_phone" data. We’ll need to split these values and then explode them to vertically stack all of the values. Finally we’ll need to reach back into the source
DataFrame
to grab the correct"name"
s for each phone number. -
Finally, we stick these 2 parts on top of each other via
pd.concat
!
Assuming your DataFrame
is stored in a variable called df
…
import pandas as pd
# easy
work_phones = df[['name', 'phone']]
# little trickier
other_phones = (
df['other_phone'].str.split(',')
.explode()
.to_frame('phone')
# grab the correct name from the original DataFrame
.join(df['name'])
)
Now that we have our parts, we just need to stack them on top of each other with pd.concat
and a few arguments.
final = (
pd.concat([work_phones, other_phones], names=['phone_type'], keys=['work', 'other'])
.reset_index(level=0)
# sort the data to match OP output
.sort_values(['name', 'phone_type'], ascending=[True, False])
)
print(final)
phone_type name phone
0 work alice (111) 111-1111
0 other alice (222) 222-2222
0 other alice (333) 333-3333
1 work bob (444) 444-4444
1 other bob (555) 555-5555
1 other bob (666) 666-6666
2 work colin (777) 777-7777
2 other colin (888) 888-8888
3 work david (999) 999-9999
3 other david NaN
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['alice', 'bob', 'colin', 'david'],
'phone': ['111', '444', '777', '999'],
'other_phone':
['222, 333', '555, 666', '888', np.nan]})
print(df)
"""
name phone other_phone
0 alice 111 222, 333
1 bob 444 555, 666
2 colin 777 888
3 david 999 NaN
"""
df = (
df.join(
df.pop( 'other_phone').str.split(',',expand=True)
.stack()
.reset_index(level=1,drop=True)
.rename( 'other_phone')
)
)
print(df)
"""
name phone other_phone
0 alice 111 222
0 alice 111 333
1 bob 444 555
1 bob 444 666
2 colin 777 888
3 david 999 NaN
"""
workfones = df[['name','phone']]
print(workfones)
"""
name phone
0 alice 111
0 alice 111
1 bob 444
1 bob 444
2 colin 777
3 david 999
"""
homefones = df[['name','other_phone']].rename(columns={'other_phone':'phone'})
print(homefones)
"""
name other_phone
0 alice 222
0 alice 333
1 bob 555
1 bob 666
2 colin 888
3 david NaN
"""
res = pd.concat([workfones, homefones], names=['phone_type'], keys=['office', 'home'])
print(res)
"""
name phone
phone_type
office 0 alice 111
0 alice 111
1 bob 444
1 bob 444
2 colin 777
3 david 999
home 0 alice 222
0 alice 333
1 bob 555
1 bob 666
2 colin 888
3 david NaN
"""
res1 = (res.reset_index(level=0) )
print(res1)
"""
phone_type name phone
0 office alice 111
0 office alice 111
1 office bob 444
1 office bob 444
2 office colin 777
3 office david 999
0 home alice 222
0 home alice 333
1 home bob 555
1 home bob 666
2 home colin 888
3 home david NaN
"""
res2 = res1.sort_values(by=['name','phone_type'])
print(res2)
"""
phone_type name phone
0 home alice 222
0 home alice 333
0 office alice 111
0 office alice 111
1 home bob 555
1 home bob 666
1 office bob 444
1 office bob 444
2 home colin 888
2 office colin 777
3 home david NaN
3 office david 999
"""
res3 = res2.reset_index(level=0).drop('index', axis=1)
print(res3)
"""
phone_type name phone
0 home alice 222
1 home alice 333
2 office alice 111
3 office alice 111
4 home bob 555
5 home bob 666
6 office bob 444
7 office bob 444
8 home colin 888
9 office colin 777
10 home david NaN
11 office david 999
"""