Create a list of single-entry dictionaries where each group by a given column contributes a value from a 2nd column for all but 1st row which is key
Question:
I have a pandas dataframe that looks like this:
header1
header2
First
row1
Second
row2
Third
row1
Fourth
row2
Fifth
row1
I want to create a list of dictionaries where, for all rows with matching value in the header2 column (except the first such row), a dictionary is added to the list using the first row’s header1 column value as the lone dict key, and every other row’s header1 column value as the lone dict value.
Expected output:
[{"First":"Third},{"Second":"Fourth"}, {"First":"Fifth"}]
or even
{"First":"Third","Second":"Fourth"} (This output doesn’t handle multiple matches in header2)
Ideally the solution isn’t going to be computationally intensive as I am able to accomplish this with nested for loops already.
Edit based on something brought up in comments: In case of multiple values in the first column with matching header2, assume first occurrence will be the key and duplicate with the value. For example: [{"First":"Third},{"Second":"Fourth"}, {"First":"Fifth"}]. In other words, the header1 value in the first matching row will be repeating key, with one single-entry dict added to the result list for each subsequent matching row.
Thank you
Answers:
Here’s a way to do what your question asks:
out = []
df.groupby('header2')['header1'].apply(lambda x: out.extend([{x.iloc[0]:x.iloc[i]} for i in range(1, len(x))]) if len(x) > 1 else None)
idxByHeader1 = df.reset_index(drop=False).set_index('header1')['index']
out = sorted(out, key=lambda x: idxByHeader1[list(x.values())[0]])
Output:
[{'First': 'Third'}, {'Second': 'Fourth'}, {'First': 'Fifth'}]
UPDATE:
Here is a slightly more robust answer. Assuming values in the header1
column can be duplicated across different header2
values, this updated answer will ensure that the dictionaries in the result list preserve the order found in the original dataframe.
out = []
df.assign(dup=df.apply(tuple, axis=1)).groupby('header2')['dup'].apply(
lambda x: out.extend([{x.iloc[0][0]:x.iloc[i]}
for i in range(1, len(x))]) if len(x) > 1 else None)
idx = df.reset_index(drop=False).set_index(['header1','header2'])['index']
out = sorted(out, key=lambda x: idx[list(x.values())[0]])
out = [{key:val[0]} for item in out for key, val in item.items()]
print(out)
Sample Input: (note the duplication of Fifth
, for key Second
and again for key First
):
header1 header2
0 First row1
1 Second row2
2 Third row1
3 Fifth row2
4 Fifth row1
Output: (note that for the two dicts with Fifth
as value, the dict with Second
as key appears before the dict with First
as key, which is identical to the sequencing in the original dataframe, since the first Fifth
encountered had header2
value matching Second
):
[{'First': 'Third'}, {'Second': 'Fifth'}, {'First': 'Fifth'}]
I have a pandas dataframe that looks like this:
header1 | header2 |
---|---|
First | row1 |
Second | row2 |
Third | row1 |
Fourth | row2 |
Fifth | row1 |
I want to create a list of dictionaries where, for all rows with matching value in the header2 column (except the first such row), a dictionary is added to the list using the first row’s header1 column value as the lone dict key, and every other row’s header1 column value as the lone dict value.
Expected output:
[{"First":"Third},{"Second":"Fourth"}, {"First":"Fifth"}]
or even
{"First":"Third","Second":"Fourth"} (This output doesn’t handle multiple matches in header2)
Ideally the solution isn’t going to be computationally intensive as I am able to accomplish this with nested for loops already.
Edit based on something brought up in comments: In case of multiple values in the first column with matching header2, assume first occurrence will be the key and duplicate with the value. For example: [{"First":"Third},{"Second":"Fourth"}, {"First":"Fifth"}]. In other words, the header1 value in the first matching row will be repeating key, with one single-entry dict added to the result list for each subsequent matching row.
Thank you
Here’s a way to do what your question asks:
out = []
df.groupby('header2')['header1'].apply(lambda x: out.extend([{x.iloc[0]:x.iloc[i]} for i in range(1, len(x))]) if len(x) > 1 else None)
idxByHeader1 = df.reset_index(drop=False).set_index('header1')['index']
out = sorted(out, key=lambda x: idxByHeader1[list(x.values())[0]])
Output:
[{'First': 'Third'}, {'Second': 'Fourth'}, {'First': 'Fifth'}]
UPDATE:
Here is a slightly more robust answer. Assuming values in the header1
column can be duplicated across different header2
values, this updated answer will ensure that the dictionaries in the result list preserve the order found in the original dataframe.
out = []
df.assign(dup=df.apply(tuple, axis=1)).groupby('header2')['dup'].apply(
lambda x: out.extend([{x.iloc[0][0]:x.iloc[i]}
for i in range(1, len(x))]) if len(x) > 1 else None)
idx = df.reset_index(drop=False).set_index(['header1','header2'])['index']
out = sorted(out, key=lambda x: idx[list(x.values())[0]])
out = [{key:val[0]} for item in out for key, val in item.items()]
print(out)
Sample Input: (note the duplication of Fifth
, for key Second
and again for key First
):
header1 header2
0 First row1
1 Second row2
2 Third row1
3 Fifth row2
4 Fifth row1
Output: (note that for the two dicts with Fifth
as value, the dict with Second
as key appears before the dict with First
as key, which is identical to the sequencing in the original dataframe, since the first Fifth
encountered had header2
value matching Second
):
[{'First': 'Third'}, {'Second': 'Fifth'}, {'First': 'Fifth'}]