merge consecutive matching rows
Question:
I want to merge all consecutive rows by matching all ‘X’ fields and concatenating the ‘Y’ field.
Below is sample data –
[Y X1 X2 X3 X4 X5
A NaN -3810 TRUE None None
B NaN -3810 TRUE None None
C NaN -3810 TRUE None None
D NaN -3810 None None None
E NaN -3810 None None None
F NaN -3810 None None None
G NaN -3810 None None None
H NaN -3810 TRUE None None
I NaN 2540 TRUE None None
J NaN 2540 None True None]
Expected output –
[A B C NaN -3810 TRUE None None
D E F G NaN -3810 None None None
H NaN -3810 TRUE None None
I NaN 2540 TRUE None None
J NaN 2540 None True None]
As stated, if any of the X fields changes in the consecutive row, they shouldn’t concatenate.
Thanks in advance.
Answers:
A little bit tricky , using shift
create the groupkey , then agg
df.fillna('NaN',inplace=True) # notice here NaN always no equal to NaN, so I replace it with string 'NaN'
df.groupby((df.drop('Y',1)!=df.drop('Y',1).shift()).any(1).cumsum()).
agg(lambda x : ','.join(x) if x.name=='Y' else x.iloc[0])
Out[19]:
Y X1 X2 X3 X4 X5
1 A,B,C NaN -3810 TRUE None None
2 D,E,F,G NaN -3810 None None None
3 H NaN -3810 TRUE None None
itertools.groupby
Good to remember that itertools.groupby
handles constructiveness for us.
from itertools import groupby
Y = df.Y
X = df.filter(like='X').T # df.drop('Y', 1).T
K = lambda x: (*X[x].fillna('NA'),)
tups = [
(' '.join(Y.loc[V]), *X[V[0]])
for _, [*V] in groupby(Y.index, key=K)
]
pd.DataFrame(tups, columns=df.columns)
Y X1 X2 X3 X4 X5
0 A B C NaN -3810 TRUE None None
1 D E F G NaN -3810 None None None
2 H NaN -3810 TRUE None None
3 I NaN 2540 TRUE None None
4 J NaN 2540 None True None
col1=df1.iloc[:,1:].astype(str).agg("*".join,axis=1)
col2=col1.ne(col1.shift()).cumsum()
df1.groupby(col2).agg({'Y':' '.join,'X1':'first','X2':'first','X3':'first','X4':'first','X5':'first'})
out:
Y X1 X2 X3 X4 X5
0 A B C NaN -3810 TRUE None None
1 D E F G NaN -3810 None None None
2 H NaN -3810 TRUE None None
3 I NaN 2540 TRUE None None
4 J NaN 2540 None True None
I want to merge all consecutive rows by matching all ‘X’ fields and concatenating the ‘Y’ field.
Below is sample data –
[Y X1 X2 X3 X4 X5
A NaN -3810 TRUE None None
B NaN -3810 TRUE None None
C NaN -3810 TRUE None None
D NaN -3810 None None None
E NaN -3810 None None None
F NaN -3810 None None None
G NaN -3810 None None None
H NaN -3810 TRUE None None
I NaN 2540 TRUE None None
J NaN 2540 None True None]
Expected output –
[A B C NaN -3810 TRUE None None
D E F G NaN -3810 None None None
H NaN -3810 TRUE None None
I NaN 2540 TRUE None None
J NaN 2540 None True None]
As stated, if any of the X fields changes in the consecutive row, they shouldn’t concatenate.
Thanks in advance.
A little bit tricky , using shift
create the groupkey , then agg
df.fillna('NaN',inplace=True) # notice here NaN always no equal to NaN, so I replace it with string 'NaN'
df.groupby((df.drop('Y',1)!=df.drop('Y',1).shift()).any(1).cumsum()).
agg(lambda x : ','.join(x) if x.name=='Y' else x.iloc[0])
Out[19]:
Y X1 X2 X3 X4 X5
1 A,B,C NaN -3810 TRUE None None
2 D,E,F,G NaN -3810 None None None
3 H NaN -3810 TRUE None None
itertools.groupby
Good to remember that itertools.groupby
handles constructiveness for us.
from itertools import groupby
Y = df.Y
X = df.filter(like='X').T # df.drop('Y', 1).T
K = lambda x: (*X[x].fillna('NA'),)
tups = [
(' '.join(Y.loc[V]), *X[V[0]])
for _, [*V] in groupby(Y.index, key=K)
]
pd.DataFrame(tups, columns=df.columns)
Y X1 X2 X3 X4 X5
0 A B C NaN -3810 TRUE None None
1 D E F G NaN -3810 None None None
2 H NaN -3810 TRUE None None
3 I NaN 2540 TRUE None None
4 J NaN 2540 None True None
col1=df1.iloc[:,1:].astype(str).agg("*".join,axis=1)
col2=col1.ne(col1.shift()).cumsum()
df1.groupby(col2).agg({'Y':' '.join,'X1':'first','X2':'first','X3':'first','X4':'first','X5':'first'})
out:
Y X1 X2 X3 X4 X5
0 A B C NaN -3810 TRUE None None
1 D E F G NaN -3810 None None None
2 H NaN -3810 TRUE None None
3 I NaN 2540 TRUE None None
4 J NaN 2540 None True None