Combine unnormalized key/value columns into normalized form

Question:

I’m working on a legacy database with a table that looks like this:

Account Key1 Key2 Key3 Val1 Val2 Val3
1 Home Work 555-555-1111 555-555-2222
2 Home 555-555-3333
3 Mobile Work Home 555-555-4444 555-555-5555 555-555-6666

I’d like to transform that into a properly normalized form in Pandas. The desired output looks like this:

Account PhoneType PhoneNumber
1 Home 555-555-1111
1 Work 555-555-2222
2 Home 555-555-3333
3 Mobile 555-555-4444
3 Work 555-555-5555
3 Home 555-555-6666

The following code will create a dataframe to start with:

import pandas as pd

df = pd.DataFrame([
    {
        "Account": "1",
        "Key1": "Home",
        "Key2": "Work",
        "Val1": "555-555-1111",
        "Val2": "555-555-2222"
    },
    {
        "Account": "2",
        "Key1": "Home",
        "Val1": "555-555-3333"
    },
    {
        "Account": "3",
        "Key1": "Mobile",
        "Key2": "Work",
        "Key3": "Home",
        "Val1": "555-555-4444",
        "Val2": "555-555-5555",
        "Val3": "555-555-6666"
    }
])

What’s the cleanest / most efficient way to transform the dataframe as indicated above?

Asked By: Dacromir

||

Answers:

You could use wide-to-long for this

(
    pd.wide_to_long(df,
                ['Key','Val'],
                i='Account', j='n')
      .dropna()
      .reset_index()
      .drop(columns='n')
      .sort_values(by='Account')
)

Output

 Account     Key           Val
0       1    Home  555-555-1111
3       1    Work  555-555-2222
1       2    Home  555-555-3333
2       3  Mobile  555-555-4444
4       3    Work  555-555-5555
5       3    Home  555-555-6666

Edit: You could achieve this w/melt and lookup, but since lookup is deprecated it takes a little more effort.

import numpy as np
x = df.melt(id_vars=['Account','Key1','Key2','Key3'],
            var_name='PhoneType',
            value_name='PhoneNumber')

# This used to be what the lookup method in pandas did prior to being depricated
idx, cols =  pd.factorize(x['PhoneType'].str.replace('Val','Key'))
x['PhoneType']  = x.reindex(cols, axis=1).to_numpy()[np.arange(len(x)), idx]


x[['Account','PhoneType','PhoneNumber']].dropna().sort_values(by='Account')
Answered By: Chris

Personally, I would do the transformation before moving to pandas. Many transforms like this seem more straightforward outside of pandas.

import pandas as pd

data = [
    {
        "Account": "1",
        "Key1": "Home",
        "Key2": "Work",
        "Val1": "555-555-1111",
        "Val2": "555-555-2222"
    },
    {
        "Account": "2",
        "Key1": "Home",
        "Val1": "555-555-3333"
    },
    {
        "Account": "3",
        "Key1": "Mobile",
        "Key2": "Work",
        "Key3": "Home",
        "Val1": "555-555-4444",
        "Val2": "555-555-5555",
        "Val3": "555-555-6666"
    }
]

rows = []
for row in data:
    for n in "123":
        if f"Key{n}" in row:
            rows.append((row['Account'],row[f'Key{n}'],row[f'Val{n}']))

df = pd.DataFrame(rows, columns=["Account","PhoneType","PhoneNumber"])
print(df)

Output:

  Account PhoneType   PhoneNumber
0       1      Home  555-555-1111
1       1      Work  555-555-2222
2       2      Home  555-555-3333
3       3    Mobile  555-555-4444
4       3      Work  555-555-5555
5       3      Home  555-555-6666
Answered By: Tim Roberts

try:

df
    Account Key1    Key2    Val1            Val2            Key3    Val3
0   1       Home    Work    555-555-1111    555-555-2222    NaN     NaN
1   2       Home    NaN     555-555-3333    NaN             NaN     NaN
2   3       Mobile  Work    555-555-4444    555-555-5555    Home    555-555-6666

df['home_phone']   = np.select([df['Key1'].eq('Home'),df['Key2'].eq('Home'),df['Key3'].eq('Home')], [df['Val1'], df['Val2'], df['Val3']], None)
df['work_phone']   = np.select([df['Key1'].eq('Work'),df['Key2'].eq('Work'),df['Key3'].eq('Work')], [df['Val1'], df['Val2'], df['Val3']], None)
df['mobile_phone'] = np.select([df['Key1'].eq('Mobile'),df['Key2'].eq('Mobile'),df['Key3'].eq('Mobile')], [df['Val1'], df['Val2'], df['Val3']], None)
df = df.drop(columns=['Key1', 'Key2', 'Val1', 'Val2', 'Key3', 'Val3',])

df

    Account home_phone      work_phone      mobile_phone
0   1       555-555-1111    555-555-2222    None
1   2       555-555-3333    None            None
2   3       555-555-6666    555-555-5555    555-555-4444
Answered By: khaled koubaa
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.