Groupby and concatenate unique values by separator in Pandas dataframa
Question:
I have following pandas dataframe.
org_id org_name location_id loc_status city country
0 100023310 advance GmbH LOC-100052061 ACTIVE Planegg Germany
1 100023310 advance GmbH LOC-100032442 ACTIVE Planegg Germany
2 100023310 advance GmbH LOC-100042003 INACTIVE Planegg Germany
3 100004261 Beacon Limited LOC-100005615 ACTIVE Tunbridge Wells United Kingdom
4 100004261 Beacon Limited LOC-100000912 ACTIVE Crowborough United Kingdom
I would like to group the rows by column org_id, org_name and find unique and concatenate value by a separator ‘|’ other column values.
I am using following lines of code.
gr_columns = [x for x in df.columns if x not in ['location_id', 'loc_status','city', 'country']]
df.groupby(gr_columns).agg(lambda col: '|'.join(col))
However, the final dataframe has some of the columns missing (city and country). I am getting following output.
org_id org_name location_id loc_status
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE
2 100004261 Beacon Limited LOC-100005615 ACTIVE
With the following warning as well.
FutureWarning: Dropping invalid columns in DataFrameGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the function.
df.groupby(gr_columns).agg(lambda col: ','.join(col))
The expected output is:
org_id org_name location_id loc_status city country
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE Planegg Germany
2 100004261 Beacon Limited LOC-100005615 ACTIVE Tunbridge Wells|Crowborough United Kingdom
Any help is highly appreciated.
Answers:
Update
In fact, it seems you want to join everything with unique values:
join_unique = lambda x: '|'.join(x.unique())
out = df.groupby(['org_id', 'org_name'], as_index=False).agg(join_unique)
print(out)
# Output with pd.pandas.set_option('display.max_columns', None)
org_id org_name location_id
0 100004261 Beacon Limited LOC-100005615|LOC-100000912
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003
loc_status city country
0 ACTIVE Tunbridge Wells|Crowborough United Kingdom
1 ACTIVE|INACTIVE Planegg Germany
Old answer
You can use groupby_agg
:
>>> (df.groupby(['org_id', 'org_name'], as_index=False)
.agg({'location_id': '|'.join, 'city': 'first', 'country': 'first'}))
org_id org_name location_id city country
0 100004261 Beacon Limited LOC-100005615|LOC-100000912 Tunbridge Wells United Kingdom
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 Planegg Germany
I think you are looking for:
df.groupby(['org_id', 'org_name'], as_index=False).agg(lambda x: '|'.join(x.unique()))
org_id org_name location_id
0 100004261 Beacon Limited LOC-100005615|LOC-100000912
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003
loc_status city country
0 ACTIVE Tunbridge Wells|Crowborough Kingdom
1 ACTIVE|INACTIVE Planegg Germany
Here’s a way to do what your question asks:
print( df.groupby(['org_id','org_name']).agg(lambda col: '|'.join(col.drop_duplicates())).reset_index() )
Output:
org_id org_name location_id loc_status city country
0 100004261 Beacon Limited LOC-100005615|LOC-100000912 ACTIVE Tunbridge Wells|Crowborough United Kingdom
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE Planegg Germany
I have following pandas dataframe.
org_id org_name location_id loc_status city country
0 100023310 advance GmbH LOC-100052061 ACTIVE Planegg Germany
1 100023310 advance GmbH LOC-100032442 ACTIVE Planegg Germany
2 100023310 advance GmbH LOC-100042003 INACTIVE Planegg Germany
3 100004261 Beacon Limited LOC-100005615 ACTIVE Tunbridge Wells United Kingdom
4 100004261 Beacon Limited LOC-100000912 ACTIVE Crowborough United Kingdom
I would like to group the rows by column org_id, org_name and find unique and concatenate value by a separator ‘|’ other column values.
I am using following lines of code.
gr_columns = [x for x in df.columns if x not in ['location_id', 'loc_status','city', 'country']]
df.groupby(gr_columns).agg(lambda col: '|'.join(col))
However, the final dataframe has some of the columns missing (city and country). I am getting following output.
org_id org_name location_id loc_status
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE
2 100004261 Beacon Limited LOC-100005615 ACTIVE
With the following warning as well.
FutureWarning: Dropping invalid columns in DataFrameGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the function.
df.groupby(gr_columns).agg(lambda col: ','.join(col))
The expected output is:
org_id org_name location_id loc_status city country
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE Planegg Germany
2 100004261 Beacon Limited LOC-100005615 ACTIVE Tunbridge Wells|Crowborough United Kingdom
Any help is highly appreciated.
Update
In fact, it seems you want to join everything with unique values:
join_unique = lambda x: '|'.join(x.unique())
out = df.groupby(['org_id', 'org_name'], as_index=False).agg(join_unique)
print(out)
# Output with pd.pandas.set_option('display.max_columns', None)
org_id org_name location_id
0 100004261 Beacon Limited LOC-100005615|LOC-100000912
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003
loc_status city country
0 ACTIVE Tunbridge Wells|Crowborough United Kingdom
1 ACTIVE|INACTIVE Planegg Germany
Old answer
You can use groupby_agg
:
>>> (df.groupby(['org_id', 'org_name'], as_index=False)
.agg({'location_id': '|'.join, 'city': 'first', 'country': 'first'}))
org_id org_name location_id city country
0 100004261 Beacon Limited LOC-100005615|LOC-100000912 Tunbridge Wells United Kingdom
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 Planegg Germany
I think you are looking for:
df.groupby(['org_id', 'org_name'], as_index=False).agg(lambda x: '|'.join(x.unique()))
org_id org_name location_id
0 100004261 Beacon Limited LOC-100005615|LOC-100000912
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003
loc_status city country
0 ACTIVE Tunbridge Wells|Crowborough Kingdom
1 ACTIVE|INACTIVE Planegg Germany
Here’s a way to do what your question asks:
print( df.groupby(['org_id','org_name']).agg(lambda col: '|'.join(col.drop_duplicates())).reset_index() )
Output:
org_id org_name location_id loc_status city country
0 100004261 Beacon Limited LOC-100005615|LOC-100000912 ACTIVE Tunbridge Wells|Crowborough United Kingdom
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE Planegg Germany