Python/Pandas : How to self join a pandas dataframe on rows with same index
Question:
I have a dataframe that looks like below
merge_id
identifier
Location
Value
1
A1
DEL
50
1
B2
HYD
60
2
C1
BEN
80
2
D2
HYD
10
I want the output dataframe to look like below
merge_id
identifier
Location
Value
m_identifier
m_Location
m_Value
1
A1
DEL
50
B2
HYD
60
2
C1
BEN
80
D2
HYD
10
Please can you suggest how I can do that
Answers:
here is one way about it
df2=df.merge(df.mask(df['identifier'].str.endswith('1')),
on='merge_id',
how='left',
suffixes=(None,'_m'))
df2=df2.mask(df2['identifier'].eq(df2['identifier_m']))
df2.dropna()
merge_id identifier Location Value identifier_m Location_m Value_m
0 1.0 A1 DEL 50.0 B2 HYD 60.0
2 2.0 C1 BEN 80.0 D2 HYD 10.0
This looks like a pivot
with a few tweaks:
df2 = (df.assign(c=df.groupby('merge_id').cumcount())
.pivot(index='merge_id', columns='c')
.sort_index(level=1, sort_remaining=False, axis=1)
)
df2.columns = df2.columns.map(lambda x: f'{"m_" if x[1] else ""}{x[0]}')
print(df2.reset_index())
output:
merge_id identifier Location Value m_identifier m_Location m_Value
0 1 A1 DEL 50 B2 HYD 60
1 2 C1 BEN 80 D2 HYD 10
Another possible solution:
grouped = df.groupby('merge_id')
df1 = df.loc[grouped.head(1).index]
df2 = df.loc[grouped.tail(1).index].add_prefix('m_')
out = (df1.merge(df2, left_on='merge_id', right_on='m_merge_id')
.drop('m_merge_id', axis = 1))
Output:
merge_id identifier Location Value m_identifier m_Location m_Value
0 1 A1 DEL 50 B2 HYD 60
1 2 C1 BEN 80 D2 HYD 10
I have a dataframe that looks like below
merge_id | identifier | Location | Value |
---|---|---|---|
1 | A1 | DEL | 50 |
1 | B2 | HYD | 60 |
2 | C1 | BEN | 80 |
2 | D2 | HYD | 10 |
I want the output dataframe to look like below
merge_id | identifier | Location | Value | m_identifier | m_Location | m_Value |
---|---|---|---|---|---|---|
1 | A1 | DEL | 50 | B2 | HYD | 60 |
2 | C1 | BEN | 80 | D2 | HYD | 10 |
Please can you suggest how I can do that
here is one way about it
df2=df.merge(df.mask(df['identifier'].str.endswith('1')),
on='merge_id',
how='left',
suffixes=(None,'_m'))
df2=df2.mask(df2['identifier'].eq(df2['identifier_m']))
df2.dropna()
merge_id identifier Location Value identifier_m Location_m Value_m
0 1.0 A1 DEL 50.0 B2 HYD 60.0
2 2.0 C1 BEN 80.0 D2 HYD 10.0
This looks like a pivot
with a few tweaks:
df2 = (df.assign(c=df.groupby('merge_id').cumcount())
.pivot(index='merge_id', columns='c')
.sort_index(level=1, sort_remaining=False, axis=1)
)
df2.columns = df2.columns.map(lambda x: f'{"m_" if x[1] else ""}{x[0]}')
print(df2.reset_index())
output:
merge_id identifier Location Value m_identifier m_Location m_Value
0 1 A1 DEL 50 B2 HYD 60
1 2 C1 BEN 80 D2 HYD 10
Another possible solution:
grouped = df.groupby('merge_id')
df1 = df.loc[grouped.head(1).index]
df2 = df.loc[grouped.tail(1).index].add_prefix('m_')
out = (df1.merge(df2, left_on='merge_id', right_on='m_merge_id')
.drop('m_merge_id', axis = 1))
Output:
merge_id identifier Location Value m_identifier m_Location m_Value
0 1 A1 DEL 50 B2 HYD 60
1 2 C1 BEN 80 D2 HYD 10