Duplicating rows where a cell contains multiple pieces of data

Question

I would like to take a dataframe and duplicate certain rows.
One column, called name, may have multiple names.
An example dataframe is contructed below:

data = [
    ['Joe', '17-11-2018', '2'],
    ['Karen', '17-11-2018', '4'],
    ['Bill, Avery', '17-11-2018', '6'],
    ['Sam', '18-11-2018', '4'],
    ['Alex, Frank', '18-11-2018', '6'],
    ['Chris', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['name','date','number'])

This yields the following dataframe:

          name        date number
0          Joe  17-11-2018      2
1        Karen  17-11-2018      4
2  Bill, Avery  17-11-2018      6
3          Sam  18-11-2018      4
4  Alex, Frank  18-11-2018      6
5        Chris  18-11-2018      8

I would like to take all rows where there are multiple names (comma-separated) and duplicate them for each individual name. The resulting dataframe should look like this:

    name        date number
0    Joe  17-11-2018      2
1  Karen  17-11-2018      4
2   Bill  17-11-2018      6
3  Avery  17-11-2018      6
4    Sam  18-11-2018      4
5   Alex  18-11-2018      6
6  Frank  18-11-2018      6
7  Chris  18-11-2018      8

Asked By: Jack Walsh

||

Source

Answer 1

After str.split , it become a unnest problem

df['name']=df.name.str.split(',')

unnesting(df,['name'])
Out[97]: 
     name        date number
0     Joe  17-11-2018      2
1   Karen  17-11-2018      4
2    Bill  17-11-2018      6
2   Avery  17-11-2018      6
3     Sam  18-11-2018      4
4    Alex  18-11-2018      6
4   Frank  18-11-2018      6
5   Chris  18-11-2018      8

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx
    return df1.join(df.drop(explode, 1), how='left')

Answered By: BENY

Answer 2

Jack. I don’t use dataframes much, but the following code should work before df = pd.DataFrame(data, columns = ['name','date','number'])

new_data = []
for item in data:
    if "," in item[0]:
        new_data.append([item[0].split(", ")[0], item[1], item[2]])
        new_data.append([item[0].split(", ")[1], item[1], item[2]])
    else:
        new_data.append(item)

Answered By: TechPerson

Answer 3

Update 2023

All the answers have old methods, we now have Series.explode method, which can unnest a list. So the modern way to do this is:

df.assign(name=df["name"].str.split(", ")).explode("name", ignore_index=True)

    name        date number
0    Joe  17-11-2018      2
1  Karen  17-11-2018      4
2   Bill  17-11-2018      6
3  Avery  17-11-2018      6
4    Sam  18-11-2018      4
5   Alex  18-11-2018      6
6  Frank  18-11-2018      6
7  Chris  18-11-2018      8

Old answer

For a string with a separator you can use the following function found in this answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

explode_str(df, 'name', ',')

     name        date number
0     Joe  17-11-2018      2
1   Karen  17-11-2018      4
2    Bill  17-11-2018      6
2   Avery  17-11-2018      6
3     Sam  18-11-2018      4
4    Alex  18-11-2018      6
4   Frank  18-11-2018      6
5   Chris  18-11-2018      8

Answered By: Erfan

Answer 4

I believe I read something like this.
As soon as I locate the link, I will share it.

from itertools import chain
import time
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd 
from itertools import chain
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd 
import re

data = [
    ['J', '17-11-2018', '2'],
    ['K', '17-11-2018', '4'],
    ['B, A', '17-11-2018', '6'],
    ['S', '18-11-2018', '4'],
    ['L, F', '18-11-2018', '6'],
    ['C', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['P','Q','R'])
print(df)

"""
      P           Q  R
0     J  17-11-2018  2
1     K  17-11-2018  4
2  B, A  17-11-2018  6
3     S  18-11-2018  4
4  L, F  18-11-2018  6
5     C  18-11-2018  8
                    
"""


m1 = (lambda col:pd.Series(col).str.split(','))
aa = df.set_index(['R','Q']).apply(m1)
print(aa)
"""
                   P
R Q                  
2 17-11-2018      [J]
4 17-11-2018      [K]
6 17-11-2018  [B,  A]
4 18-11-2018      [S]
6 18-11-2018  [L,  F]
8 18-11-2018      [C]

"""
res = aa.explode('P').reset_index().reindex(df.columns,axis=1)
print(res)

"""
  P           Q  R
0   J  17-11-2018  2
1   K  17-11-2018  4
2   B  17-11-2018  6
3   A  17-11-2018  6
4   S  18-11-2018  4
5   L  18-11-2018  6
6   F  18-11-2018  6
7   C  18-11-2018  8

"""

Answered By: Soudipta Dutta

Duplicating rows where a cell contains multiple pieces of data

Question:

Answers:

Update 2023

Old answer