Pandas – Duplicate rows with phone numbers based on type

Question:

My dataframe is in the following format:

name phone other_phone
alice (111) 111-1111 (222) 222-2222, (333) 333-3333
bob (444) 444-4444 (555) 555-5555, (666) 666-6666
colin (777) 777-7777 (888) 888-8888
david (999) 999-9999 NaN

I want to split the other_phone column on the comma (if there is one) and duplicate rows such that the output is:

name phone phone_type
alice (111) 111-1111 work
alice (222) 222-2222 other
alice (333) 333-3333 other
bob (444) 444-4444 work
bob (555) 555-5555 other
bob (666) 666-6666 other
colin (777) 777-7777 work
colin (888) 888-8888 other
david (999) 999-9999 work

The overall goal is to prevent there being multiple phone numbers in a single row. How could I accomplish this?

Asked By: KLG

||

Answers:

This will (obviously) take a little bit of reshaping! Specifically we’ll need to create 2 separate DataFrames that have the necessary "phone" and "name", then your "phone_type" is derived based on what DataFrame you belong to.

  1. The easiest part is to create a work_phones DataFrame, since that’s already nicely represented by the data.

  2. A little bit trickier will be to split out the "other_phone" data. We’ll need to split these values and then explode them to vertically stack all of the values. Finally we’ll need to reach back into the source DataFrame to grab the correct "name"s for each phone number.

  3. Finally, we stick these 2 parts on top of each other via pd.concat!

Assuming your DataFrame is stored in a variable called df

import pandas as pd

# easy
work_phones = df[['name', 'phone']]

# little trickier
other_phones = (
    df['other_phone'].str.split(',')
    .explode()
    .to_frame('phone')
    
    # grab the correct name from the original DataFrame
    .join(df['name'])
)

Now that we have our parts, we just need to stack them on top of each other with pd.concat and a few arguments.

final = (
    pd.concat([work_phones, other_phones], names=['phone_type'], keys=['work', 'other'])
    .reset_index(level=0)

    # sort the data to match OP output
    .sort_values(['name', 'phone_type'], ascending=[True, False])
)

print(final)
  phone_type   name            phone
0       work  alice   (111) 111-1111
0      other  alice   (222) 222-2222
0      other  alice   (333) 333-3333
1       work    bob   (444) 444-4444
1      other    bob   (555) 555-5555
1      other    bob   (666) 666-6666
2       work  colin   (777) 777-7777
2      other  colin   (888) 888-8888
3       work  david   (999) 999-9999
3      other  david              NaN
Answered By: Cameron Riddell
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime

import pandas as pd
import numpy as np

df = pd.DataFrame({'name': ['alice', 'bob', 'colin', 'david'],
                   'phone': ['111', '444', '777', '999'],
                   'other_phone':
                   ['222, 333', '555, 666', '888', np.nan]})
print(df)
"""
    name phone other_phone
0  alice   111    222, 333
1    bob   444    555, 666
2  colin   777         888
3  david   999         NaN
"""

df = (
    df.join(
        df.pop( 'other_phone').str.split(',',expand=True)
        .stack()
        .reset_index(level=1,drop=True)
        .rename( 'other_phone')
        
        )
    
    )
print(df)
"""
    name phone other_phone
0  alice   111         222
0  alice   111         333
1    bob   444         555
1    bob   444         666
2  colin   777         888
3  david   999         NaN
"""
workfones = df[['name','phone']]
print(workfones)
"""
    name phone
0  alice   111
0  alice   111
1    bob   444
1    bob   444
2  colin   777
3  david   999
"""
homefones =  df[['name','other_phone']].rename(columns={'other_phone':'phone'})
print(homefones)

"""
    name other_phone
0  alice         222
0  alice         333
1    bob         555
1    bob         666
2  colin         888
3  david         NaN
"""
res = pd.concat([workfones, homefones], names=['phone_type'], keys=['office', 'home'])
   
print(res)

"""
              name phone
phone_type               
office     0  alice   111
           0  alice   111
           1    bob   444
           1    bob   444
           2  colin   777
           3  david   999
home       0  alice   222
           0  alice   333
           1    bob   555
           1    bob   666
           2  colin   888
           3  david   NaN

"""

res1 = (res.reset_index(level=0) )

print(res1)
"""
phone_type   name phone
0     office  alice   111
0     office  alice   111
1     office    bob   444
1     office    bob   444
2     office  colin   777
3     office  david   999
0       home  alice   222
0       home  alice   333
1       home    bob   555
1       home    bob   666
2       home  colin   888
3       home  david   NaN
"""
res2 = res1.sort_values(by=['name','phone_type'])
print(res2)
"""
 phone_type   name phone
0       home  alice   222
0       home  alice   333
0     office  alice   111
0     office  alice   111
1       home    bob   555
1       home    bob   666
1     office    bob   444
1     office    bob   444
2       home  colin   888
2     office  colin   777
3       home  david   NaN
3     office  david   999
"""

res3 = res2.reset_index(level=0).drop('index', axis=1)
print(res3)
"""
   phone_type   name phone
0        home  alice   222
1        home  alice   333
2      office  alice   111
3      office  alice   111
4        home    bob   555
5        home    bob   666
6      office    bob   444
7      office    bob   444
8        home  colin   888
9      office  colin   777
10       home  david   NaN
11     office  david   999
"""
Answered By: Soudipta Dutta
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.