Extract each item from a column of lists and then pick the top items
Question:
I have the following DateFrame:
| tag | list |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |
The column list is a list of lists with each list having an item and a value between 1 to 0. The lists are arranged in descending order of this value.
I want to extract each item from here and get the top 3 item but not the item itself. Resultant data frame should be:
| item | top_3 |
| ---- | --------------------------------|
| A | [['B',0.6],['C',0.5],['D',0.3]] |
| B | [['A',0.9],['C',0.5],['D',0.3]] |
| C | [['A',0.9],['B',0.6],['D',0.3]] |
| D | [['A',0.9],['B',0.6],['C',0.5]] |
| E | [['A',0.9],['B',0.6],['C',0.5]] |
| U | [['V',0.7],['W',0.4],['X',0.3]] |
| V | [['U',0.8],['W',0.4],['X',0.3]] |
| W | [['U',0.8],['V',0.7],['X',0.3]] |
| X | [['U',0.8],['V',0.7],['W',0.4]] |
| Y | [['U',0.8],['V',0.7],['W',0.4]] |
I tried and I am able to extract the value, I am stuck at the part where I want to ignore the item itself while creating the top_3. This is what I have done:
data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]],
['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]
df = pd.DataFrame(data, columns=['tag', 'list'])
df
--
temp = {}
for idx, row in df.iterrows():
for item in row["list"]:
temp[item[0]] = row["tag"]
top_items = {}
for idx, row in df.iterrows():
top_items[row["tag"]] = row["list"]
similar = []
for item, category in temp.items():
top_3 = top_items.get(category)
sample = top_3[:3]
similar.append([item, sample])
df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]
My result:
| item | top_3 |
| ---- | --------------------------------|
| A | [['A',0.9],['B',0.6],['C',0.5]] |
| B | [['A',0.9],['B',0.6],['C',0.5]] |
| C | [['A',0.9],['B',0.6],['C',0.5]] |
| D | [['A',0.9],['B',0.6],['C',0.5]] |
| E | [['A',0.9],['B',0.6],['C',0.5]] |
| U | [['U',0.8],['V',0.7],['W',0.4]] |
| V | [['U',0.8],['V',0.7],['W',0.4]] |
| W | [['U',0.8],['V',0.7],['W',0.4]] |
| X | [['U',0.8],['V',0.7],['W',0.4]] |
| Y | [['U',0.8],['V',0.7],['W',0.4]] |
You see, the top_3 is wrong for A,B,C,U,V,W because in all cases it takes top 3 and thus doesn’t care about the item itself.
The result I get is always bringing the top 3 and I tried to put filters but unable to get it working.
If there are better ways to extract the data than how I did, do let me know ways to optimize it.
Answers:
In this part you are missing an if/else condition, you just take the 3 first items ignoring that you should not take the same item key in case is in the top 3
for item, category in temp.items():
top_3 = top_items.get(category)
sample = top_3[:3]
similar.append([item, sample])
Solution would be, remove the item from top_3 first, and then get the "sample"
for item, category in temp.items():
top_3 = top_items.get(category)
top_3_without_item = [x for x in top_3 if x[0] != item]
sample = top_3_without_item[:3]
similar.append([item, sample])
As starting point, you can explode your list
column then merge on itself. Next, you have to remove rows where the two list columns are equal and finally group the top 3 values:
out = df.explode('list')
out = (out.merge(df1, on='tag').query('list_x != list_y')
.sort_values('list_y', key=lambda x: x.str[1], ascending=False)
.assign(item=lambda x: x.pop('list_x').str[0])
.groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
.rename('top_3').reset_index())
Output:
>>> out
tag item top_3
0 icecream A [[B, 0.6], [C, 0.5], [D, 0.3]]
1 icecream B [[A, 0.9], [C, 0.5], [D, 0.3]]
2 icecream C [[A, 0.9], [B, 0.6], [D, 0.3]]
3 icecream D [[A, 0.9], [B, 0.6], [C, 0.5]]
4 icecream E [[A, 0.9], [B, 0.6], [C, 0.5]]
5 potato U [[V, 0.7], [W, 0.4], [X, 0.3]]
6 potato V [[U, 0.8], [W, 0.4], [X, 0.3]]
7 potato W [[U, 0.8], [V, 0.7], [X, 0.3]]
8 potato X [[U, 0.8], [V, 0.7], [W, 0.4]]
9 potato Y [[U, 0.8], [V, 0.7], [W, 0.4]]
You can replicate each list with the number of element it has using pandas.DataFrame.reindex
and then you can group the elements using pandas.DataFrame.groupby
and then iterate through the groups
df = df.reindex(df.index.repeat(df.list.apply(len)))
similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
for index, rows in enumerate(df_group):
similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])
Output :
This gives you the expected output :
item top3
0 A [[B, 0.6], [C, 0.5], [D, 0.3]]
1 B [[A, 0.9], [C, 0.5], [D, 0.3]]
2 C [[A, 0.9], [B, 0.6], [D, 0.3]]
3 D [[A, 0.9], [B, 0.6], [C, 0.5]]
4 E [[A, 0.9], [B, 0.6], [C, 0.5]]
5 U [[V, 0.7], [W, 0.4], [X, 0.3]]
6 V [[U, 0.8], [W, 0.4], [X, 0.3]]
7 W [[U, 0.8], [V, 0.7], [X, 0.3]]
8 X [[U, 0.8], [V, 0.7], [W, 0.4]]
9 Y [[U, 0.8], [V, 0.7], [W, 0.4]]
Alternatively,
you can also try without explicitly looping over the groups.
df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values
Output :
which gives you the same output
I have the following DateFrame:
| tag | list |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |
The column list is a list of lists with each list having an item and a value between 1 to 0. The lists are arranged in descending order of this value.
I want to extract each item from here and get the top 3 item but not the item itself. Resultant data frame should be:
| item | top_3 |
| ---- | --------------------------------|
| A | [['B',0.6],['C',0.5],['D',0.3]] |
| B | [['A',0.9],['C',0.5],['D',0.3]] |
| C | [['A',0.9],['B',0.6],['D',0.3]] |
| D | [['A',0.9],['B',0.6],['C',0.5]] |
| E | [['A',0.9],['B',0.6],['C',0.5]] |
| U | [['V',0.7],['W',0.4],['X',0.3]] |
| V | [['U',0.8],['W',0.4],['X',0.3]] |
| W | [['U',0.8],['V',0.7],['X',0.3]] |
| X | [['U',0.8],['V',0.7],['W',0.4]] |
| Y | [['U',0.8],['V',0.7],['W',0.4]] |
I tried and I am able to extract the value, I am stuck at the part where I want to ignore the item itself while creating the top_3. This is what I have done:
data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]],
['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]
df = pd.DataFrame(data, columns=['tag', 'list'])
df
--
temp = {}
for idx, row in df.iterrows():
for item in row["list"]:
temp[item[0]] = row["tag"]
top_items = {}
for idx, row in df.iterrows():
top_items[row["tag"]] = row["list"]
similar = []
for item, category in temp.items():
top_3 = top_items.get(category)
sample = top_3[:3]
similar.append([item, sample])
df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]
My result:
| item | top_3 |
| ---- | --------------------------------|
| A | [['A',0.9],['B',0.6],['C',0.5]] |
| B | [['A',0.9],['B',0.6],['C',0.5]] |
| C | [['A',0.9],['B',0.6],['C',0.5]] |
| D | [['A',0.9],['B',0.6],['C',0.5]] |
| E | [['A',0.9],['B',0.6],['C',0.5]] |
| U | [['U',0.8],['V',0.7],['W',0.4]] |
| V | [['U',0.8],['V',0.7],['W',0.4]] |
| W | [['U',0.8],['V',0.7],['W',0.4]] |
| X | [['U',0.8],['V',0.7],['W',0.4]] |
| Y | [['U',0.8],['V',0.7],['W',0.4]] |
You see, the top_3 is wrong for A,B,C,U,V,W because in all cases it takes top 3 and thus doesn’t care about the item itself.
The result I get is always bringing the top 3 and I tried to put filters but unable to get it working.
If there are better ways to extract the data than how I did, do let me know ways to optimize it.
In this part you are missing an if/else condition, you just take the 3 first items ignoring that you should not take the same item key in case is in the top 3
for item, category in temp.items():
top_3 = top_items.get(category)
sample = top_3[:3]
similar.append([item, sample])
Solution would be, remove the item from top_3 first, and then get the "sample"
for item, category in temp.items():
top_3 = top_items.get(category)
top_3_without_item = [x for x in top_3 if x[0] != item]
sample = top_3_without_item[:3]
similar.append([item, sample])
As starting point, you can explode your list
column then merge on itself. Next, you have to remove rows where the two list columns are equal and finally group the top 3 values:
out = df.explode('list')
out = (out.merge(df1, on='tag').query('list_x != list_y')
.sort_values('list_y', key=lambda x: x.str[1], ascending=False)
.assign(item=lambda x: x.pop('list_x').str[0])
.groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
.rename('top_3').reset_index())
Output:
>>> out
tag item top_3
0 icecream A [[B, 0.6], [C, 0.5], [D, 0.3]]
1 icecream B [[A, 0.9], [C, 0.5], [D, 0.3]]
2 icecream C [[A, 0.9], [B, 0.6], [D, 0.3]]
3 icecream D [[A, 0.9], [B, 0.6], [C, 0.5]]
4 icecream E [[A, 0.9], [B, 0.6], [C, 0.5]]
5 potato U [[V, 0.7], [W, 0.4], [X, 0.3]]
6 potato V [[U, 0.8], [W, 0.4], [X, 0.3]]
7 potato W [[U, 0.8], [V, 0.7], [X, 0.3]]
8 potato X [[U, 0.8], [V, 0.7], [W, 0.4]]
9 potato Y [[U, 0.8], [V, 0.7], [W, 0.4]]
You can replicate each list with the number of element it has using pandas.DataFrame.reindex
and then you can group the elements using pandas.DataFrame.groupby
and then iterate through the groups
df = df.reindex(df.index.repeat(df.list.apply(len)))
similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
for index, rows in enumerate(df_group):
similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])
Output :
This gives you the expected output :
item top3
0 A [[B, 0.6], [C, 0.5], [D, 0.3]]
1 B [[A, 0.9], [C, 0.5], [D, 0.3]]
2 C [[A, 0.9], [B, 0.6], [D, 0.3]]
3 D [[A, 0.9], [B, 0.6], [C, 0.5]]
4 E [[A, 0.9], [B, 0.6], [C, 0.5]]
5 U [[V, 0.7], [W, 0.4], [X, 0.3]]
6 V [[U, 0.8], [W, 0.4], [X, 0.3]]
7 W [[U, 0.8], [V, 0.7], [X, 0.3]]
8 X [[U, 0.8], [V, 0.7], [W, 0.4]]
9 Y [[U, 0.8], [V, 0.7], [W, 0.4]]
Alternatively,
you can also try without explicitly looping over the groups.
df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values
Output :
which gives you the same output