Reformat Excel data-frame
Question:
I managed to get my python script working to scrape data from a website using Playwright.
The website data is in a format not usable to us at the moment. Here is an example of the initial extract:
Name
Group 1
Group 2
Group 3
Group 4
Group 5
Joe Black
A
U
Joe Blue
A
A
Joe Green
U
A
Joe Red
A
U
The A in the table above means the users are admins of the group. I need to get the data above into a table that has the groups in the 1st row and in the 2nd row if they are admins of the group have their names listed. So basically I need to get it to this:
Groups
Admins
Group 1
Joe Blue,Joe Red
Group 2
Joe Red
Group 3
Joe Blue
Group 4
Joe Blue
Group 5
Joe Green
I am trying to use Pandas but completely lost on how to get the format correct. Just need some advice or a reference to a similar problem I can work off?
Answers:
You can reshape with melt
, then dropna
and groupby.agg
:
out = (df.melt('Name', var_name='Group').dropna(subset='value')
.groupby('Group')['Name'].agg(', '.join).reset_index(name='Admins')
)
Variant with a stack
:
(df.set_index('Name').rename_axis(index='Admins', columns='Group')
.stack().reset_index()
.groupby('Group', as_index=False)['Admins'].agg(', '.join)
)
Output:
Group Admins
0 Group 1 Joe Black
1 Group 2 Joe Blue, Joe Red
2 Group 3 Joe Blue
3 Group 5 Joe Green
If you unstack it, then you get a Series with a MultiIndex. You can then use a groupby and join the names corresponding to "A"-values:
def getAdmins(x):
sel = x[x == "A"]
return ",".join(sel.index.get_level_values(1)) if sel.any() else np.nan
df_new = df.unstack().groupby(level=0).agg(getAdmins)
Should you need to be robust against empty string/NAs:
df = pd.DataFrame({
'Name': ['Joe Red', 'Joe Blue', 'Joe Green'],
'Group 1': ['A', pd.NA, ''],
'Group 2': ['', 'A', 'A'],
'Group 3': ['', np.nan, 'A'],
})
df_t = df.set_index('Name').T.replace({
'A': True,
'U': False,
'': False,
pd.NA: False,
np.nan: False,
})
df_t.apply(
lambda x: df_t.columns[x].str.cat(sep=','), axis=1
).reset_index(name='Admins').rename(columns={'index': 'Groups'})
Output:
Groups Admins
0 Group 1 Joe Red
1 Group 2 Joe Blue,Joe Green
2 Group 3 Joe Green
I managed to get my python script working to scrape data from a website using Playwright.
The website data is in a format not usable to us at the moment. Here is an example of the initial extract:
Name | Group 1 | Group 2 | Group 3 | Group 4 | Group 5 |
---|---|---|---|---|---|
Joe Black | A | U | |||
Joe Blue | A | A | |||
Joe Green | U | A | |||
Joe Red | A | U |
The A in the table above means the users are admins of the group. I need to get the data above into a table that has the groups in the 1st row and in the 2nd row if they are admins of the group have their names listed. So basically I need to get it to this:
Groups | Admins |
---|---|
Group 1 | Joe Blue,Joe Red |
Group 2 | Joe Red |
Group 3 | Joe Blue |
Group 4 | Joe Blue |
Group 5 | Joe Green |
I am trying to use Pandas but completely lost on how to get the format correct. Just need some advice or a reference to a similar problem I can work off?
You can reshape with melt
, then dropna
and groupby.agg
:
out = (df.melt('Name', var_name='Group').dropna(subset='value')
.groupby('Group')['Name'].agg(', '.join).reset_index(name='Admins')
)
Variant with a stack
:
(df.set_index('Name').rename_axis(index='Admins', columns='Group')
.stack().reset_index()
.groupby('Group', as_index=False)['Admins'].agg(', '.join)
)
Output:
Group Admins
0 Group 1 Joe Black
1 Group 2 Joe Blue, Joe Red
2 Group 3 Joe Blue
3 Group 5 Joe Green
If you unstack it, then you get a Series with a MultiIndex. You can then use a groupby and join the names corresponding to "A"-values:
def getAdmins(x):
sel = x[x == "A"]
return ",".join(sel.index.get_level_values(1)) if sel.any() else np.nan
df_new = df.unstack().groupby(level=0).agg(getAdmins)
Should you need to be robust against empty string/NAs:
df = pd.DataFrame({
'Name': ['Joe Red', 'Joe Blue', 'Joe Green'],
'Group 1': ['A', pd.NA, ''],
'Group 2': ['', 'A', 'A'],
'Group 3': ['', np.nan, 'A'],
})
df_t = df.set_index('Name').T.replace({
'A': True,
'U': False,
'': False,
pd.NA: False,
np.nan: False,
})
df_t.apply(
lambda x: df_t.columns[x].str.cat(sep=','), axis=1
).reset_index(name='Admins').rename(columns={'index': 'Groups'})
Output:
Groups Admins
0 Group 1 Joe Red
1 Group 2 Joe Blue,Joe Green
2 Group 3 Joe Green