Duplicating rows where a cell contains multiple pieces of data
Question:
I would like to take a dataframe and duplicate certain rows.
One column, called name
, may have multiple names.
An example dataframe is contructed below:
data = [
['Joe', '17-11-2018', '2'],
['Karen', '17-11-2018', '4'],
['Bill, Avery', '17-11-2018', '6'],
['Sam', '18-11-2018', '4'],
['Alex, Frank', '18-11-2018', '6'],
['Chris', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['name','date','number'])
This yields the following dataframe:
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill, Avery 17-11-2018 6
3 Sam 18-11-2018 4
4 Alex, Frank 18-11-2018 6
5 Chris 18-11-2018 8
I would like to take all rows where there are multiple names (comma-separated) and duplicate them for each individual name. The resulting dataframe should look like this:
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
3 Avery 17-11-2018 6
4 Sam 18-11-2018 4
5 Alex 18-11-2018 6
6 Frank 18-11-2018 6
7 Chris 18-11-2018 8
Answers:
After str.split
, it become a unnest
problem
df['name']=df.name.str.split(',')
unnesting(df,['name'])
Out[97]:
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
2 Avery 17-11-2018 6
3 Sam 18-11-2018 4
4 Alex 18-11-2018 6
4 Frank 18-11-2018 6
5 Chris 18-11-2018 8
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Jack. I don’t use dataframes much, but the following code should work before df = pd.DataFrame(data, columns = ['name','date','number'])
new_data = []
for item in data:
if "," in item[0]:
new_data.append([item[0].split(", ")[0], item[1], item[2]])
new_data.append([item[0].split(", ")[1], item[1], item[2]])
else:
new_data.append(item)
Update 2023
All the answers have old methods, we now have Series.explode
method, which can unnest a list. So the modern way to do this is:
df.assign(name=df["name"].str.split(", ")).explode("name", ignore_index=True)
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
3 Avery 17-11-2018 6
4 Sam 18-11-2018 4
5 Alex 18-11-2018 6
6 Frank 18-11-2018 6
7 Chris 18-11-2018 8
Old answer
For a string with a separator you can use the following function found in this answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
explode_str(df, 'name', ',')
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
2 Avery 17-11-2018 6
3 Sam 18-11-2018 4
4 Alex 18-11-2018 6
4 Frank 18-11-2018 6
5 Chris 18-11-2018 8
I believe I read something like this.
As soon as I locate the link, I will share it.
from itertools import chain
import time
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd
from itertools import chain
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd
import re
data = [
['J', '17-11-2018', '2'],
['K', '17-11-2018', '4'],
['B, A', '17-11-2018', '6'],
['S', '18-11-2018', '4'],
['L, F', '18-11-2018', '6'],
['C', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['P','Q','R'])
print(df)
"""
P Q R
0 J 17-11-2018 2
1 K 17-11-2018 4
2 B, A 17-11-2018 6
3 S 18-11-2018 4
4 L, F 18-11-2018 6
5 C 18-11-2018 8
"""
m1 = (lambda col:pd.Series(col).str.split(','))
aa = df.set_index(['R','Q']).apply(m1)
print(aa)
"""
P
R Q
2 17-11-2018 [J]
4 17-11-2018 [K]
6 17-11-2018 [B, A]
4 18-11-2018 [S]
6 18-11-2018 [L, F]
8 18-11-2018 [C]
"""
res = aa.explode('P').reset_index().reindex(df.columns,axis=1)
print(res)
"""
P Q R
0 J 17-11-2018 2
1 K 17-11-2018 4
2 B 17-11-2018 6
3 A 17-11-2018 6
4 S 18-11-2018 4
5 L 18-11-2018 6
6 F 18-11-2018 6
7 C 18-11-2018 8
"""
I would like to take a dataframe and duplicate certain rows.
One column, called name
, may have multiple names.
An example dataframe is contructed below:
data = [
['Joe', '17-11-2018', '2'],
['Karen', '17-11-2018', '4'],
['Bill, Avery', '17-11-2018', '6'],
['Sam', '18-11-2018', '4'],
['Alex, Frank', '18-11-2018', '6'],
['Chris', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['name','date','number'])
This yields the following dataframe:
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill, Avery 17-11-2018 6
3 Sam 18-11-2018 4
4 Alex, Frank 18-11-2018 6
5 Chris 18-11-2018 8
I would like to take all rows where there are multiple names (comma-separated) and duplicate them for each individual name. The resulting dataframe should look like this:
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
3 Avery 17-11-2018 6
4 Sam 18-11-2018 4
5 Alex 18-11-2018 6
6 Frank 18-11-2018 6
7 Chris 18-11-2018 8
After str.split
, it become a unnest
problem
df['name']=df.name.str.split(',')
unnesting(df,['name'])
Out[97]:
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
2 Avery 17-11-2018 6
3 Sam 18-11-2018 4
4 Alex 18-11-2018 6
4 Frank 18-11-2018 6
5 Chris 18-11-2018 8
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Jack. I don’t use dataframes much, but the following code should work before df = pd.DataFrame(data, columns = ['name','date','number'])
new_data = []
for item in data:
if "," in item[0]:
new_data.append([item[0].split(", ")[0], item[1], item[2]])
new_data.append([item[0].split(", ")[1], item[1], item[2]])
else:
new_data.append(item)
Update 2023
All the answers have old methods, we now have Series.explode
method, which can unnest a list. So the modern way to do this is:
df.assign(name=df["name"].str.split(", ")).explode("name", ignore_index=True)
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
3 Avery 17-11-2018 6
4 Sam 18-11-2018 4
5 Alex 18-11-2018 6
6 Frank 18-11-2018 6
7 Chris 18-11-2018 8
Old answer
For a string with a separator you can use the following function found in this answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
explode_str(df, 'name', ',')
name date number
0 Joe 17-11-2018 2
1 Karen 17-11-2018 4
2 Bill 17-11-2018 6
2 Avery 17-11-2018 6
3 Sam 18-11-2018 4
4 Alex 18-11-2018 6
4 Frank 18-11-2018 6
5 Chris 18-11-2018 8
I believe I read something like this.
As soon as I locate the link, I will share it.
from itertools import chain
import time
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd
from itertools import chain
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd
import re
data = [
['J', '17-11-2018', '2'],
['K', '17-11-2018', '4'],
['B, A', '17-11-2018', '6'],
['S', '18-11-2018', '4'],
['L, F', '18-11-2018', '6'],
['C', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['P','Q','R'])
print(df)
"""
P Q R
0 J 17-11-2018 2
1 K 17-11-2018 4
2 B, A 17-11-2018 6
3 S 18-11-2018 4
4 L, F 18-11-2018 6
5 C 18-11-2018 8
"""
m1 = (lambda col:pd.Series(col).str.split(','))
aa = df.set_index(['R','Q']).apply(m1)
print(aa)
"""
P
R Q
2 17-11-2018 [J]
4 17-11-2018 [K]
6 17-11-2018 [B, A]
4 18-11-2018 [S]
6 18-11-2018 [L, F]
8 18-11-2018 [C]
"""
res = aa.explode('P').reset_index().reindex(df.columns,axis=1)
print(res)
"""
P Q R
0 J 17-11-2018 2
1 K 17-11-2018 4
2 B 17-11-2018 6
3 A 17-11-2018 6
4 S 18-11-2018 4
5 L 18-11-2018 6
6 F 18-11-2018 6
7 C 18-11-2018 8
"""