Python: in dataframe, how to concatenate a column across rows (col. being list of lists)?
Question:
I have a Dataframe as shown below as dataframe1:
dataframe1
ID
PATH
ABC
[[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach]]
ABC
[[apple, plum, peach], [apple, pear, peach]]
BCD
[[blueberry, plum, peach], [pear, apple, peach]]
BCD
[[plum, apple, peach], [banana, raspberry, peach]]
I would like to concatenate the value of column ‘PATH’ by column ‘ID’ and have the following results:
dataframe2 (ideal output, the output is the list of lists)
ID
PATH
ABC
[[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]
BCD
[[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]
I used the following code:
df2 = df1.groupby('id')['PATH'].apply(list)
but got the results with [[[ ]]] as shown which is list of list of lists…and is not what I want.
dataframe3 (wrong output)
ID
PATH
ABC
[[[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]]
BCD
[[[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]]
How can I get the results like in dataframe2?
Answers:
If im interpreting this properly you might want to look into the ‘merge’ function.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
Try:
df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
or
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()
Full working example:
data=[
["ABC", [["orange", "apple", "kiwi", "peach"], ["strawberry", "orange", "kiwi", "peach"]]],
["ABC", [["apple", "plum", "peach"], ["apple", "pear", "peach"]]],
["BCD", [["blueberry", "plum", "peach"], ["pear", "apple", "peach"]]],
["BCD", [["plum", "apple", "peach"], ["banana", "raspberry", "peach"]]],
]
columns = ["ID", "PATH"]
df = pd.DataFrame(data=data, columns=columns)
df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
#OR
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()
[Out]:
ID PATH
ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]
BCD [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]
I have a Dataframe as shown below as dataframe1:
dataframe1
ID | PATH |
---|---|
ABC | [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach]] |
ABC | [[apple, plum, peach], [apple, pear, peach]] |
BCD | [[blueberry, plum, peach], [pear, apple, peach]] |
BCD | [[plum, apple, peach], [banana, raspberry, peach]] |
I would like to concatenate the value of column ‘PATH’ by column ‘ID’ and have the following results:
dataframe2 (ideal output, the output is the list of lists)
ID | PATH |
---|---|
ABC | [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]] |
BCD | [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]] |
I used the following code:
df2 = df1.groupby('id')['PATH'].apply(list)
but got the results with [[[ ]]] as shown which is list of list of lists…and is not what I want.
dataframe3 (wrong output)
ID | PATH |
---|---|
ABC | [[[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]] |
BCD | [[[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]] |
How can I get the results like in dataframe2?
If im interpreting this properly you might want to look into the ‘merge’ function.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
Try:
df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
or
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()
Full working example:
data=[
["ABC", [["orange", "apple", "kiwi", "peach"], ["strawberry", "orange", "kiwi", "peach"]]],
["ABC", [["apple", "plum", "peach"], ["apple", "pear", "peach"]]],
["BCD", [["blueberry", "plum", "peach"], ["pear", "apple", "peach"]]],
["BCD", [["plum", "apple", "peach"], ["banana", "raspberry", "peach"]]],
]
columns = ["ID", "PATH"]
df = pd.DataFrame(data=data, columns=columns)
df = df.groupby("ID")["PATH"].apply(lambda x: list(itertools.chain.from_iterable(x))).reset_index()
#OR
df = df.groupby("ID")["PATH"].apply(lambda x: [z for y in x for z in y]).reset_index()
[Out]:
ID PATH
ABC [[orange, apple, kiwi, peach], [strawberry, orange, kiwi, peach], [apple, plum, peach], [apple, pear, peach]]
BCD [[blueberry, plum, peach], [pear, apple, peach], [plum, apple, peach], [banana, raspberry, peach]]