Extract each item from a column of lists and then pick the top items

Question:

I have the following DateFrame:

| tag      | list                                                |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato   | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |

The column list is a list of lists with each list having an item and a value between 1 to 0. The lists are arranged in descending order of this value.

I want to extract each item from here and get the top 3 item but not the item itself. Resultant data frame should be:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['B',0.6],['C',0.5],['D',0.3]] |
| B    | [['A',0.9],['C',0.5],['D',0.3]] |
| C    | [['A',0.9],['B',0.6],['D',0.3]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['V',0.7],['W',0.4],['X',0.3]] |
| V    | [['U',0.8],['W',0.4],['X',0.3]] |
| W    | [['U',0.8],['V',0.7],['X',0.3]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

I tried and I am able to extract the value, I am stuck at the part where I want to ignore the item itself while creating the top_3. This is what I have done:

data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]], 
        ['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]

df = pd.DataFrame(data, columns=['tag', 'list'])
df

--

temp = {}
for idx, row in df.iterrows():
    for item in row["list"]:
        temp[item[0]] = row["tag"]

top_items = {}
for idx, row in df.iterrows():
    top_items[row["tag"]] = row["list"]

similar = []
for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]

My result:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['A',0.9],['B',0.6],['C',0.5]] |
| B    | [['A',0.9],['B',0.6],['C',0.5]] |
| C    | [['A',0.9],['B',0.6],['C',0.5]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['U',0.8],['V',0.7],['W',0.4]] |
| V    | [['U',0.8],['V',0.7],['W',0.4]] |
| W    | [['U',0.8],['V',0.7],['W',0.4]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

You see, the top_3 is wrong for A,B,C,U,V,W because in all cases it takes top 3 and thus doesn’t care about the item itself.

The result I get is always bringing the top 3 and I tried to put filters but unable to get it working.

If there are better ways to extract the data than how I did, do let me know ways to optimize it.

Asked By: trojan horse

||

Answers:

In this part you are missing an if/else condition, you just take the 3 first items ignoring that you should not take the same item key in case is in the top 3

for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

Solution would be, remove the item from top_3 first, and then get the "sample"

for item, category in temp.items():
    top_3 = top_items.get(category)
    top_3_without_item = [x for x in top_3 if x[0] != item]
    sample = top_3_without_item[:3]
    similar.append([item, sample])
Answered By: AlvaroSch

As starting point, you can explode your list column then merge on itself. Next, you have to remove rows where the two list columns are equal and finally group the top 3 values:

out = df.explode('list')

out = (out.merge(df1, on='tag').query('list_x != list_y')
          .sort_values('list_y', key=lambda x: x.str[1], ascending=False)
          .assign(item=lambda x: x.pop('list_x').str[0])
          .groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
          .rename('top_3').reset_index())

Output:

>>> out
        tag item                           top_3
0  icecream    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1  icecream    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2  icecream    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3  icecream    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4  icecream    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    potato    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    potato    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    potato    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    potato    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    potato    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]
Answered By: Corralien

You can replicate each list with the number of element it has using pandas.DataFrame.reindex and then you can group the elements using pandas.DataFrame.groupby and then iterate through the groups

df = df.reindex(df.index.repeat(df.list.apply(len)))

similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
    for index, rows in enumerate(df_group):
        similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])

Output :

This gives you the expected output :

  item                            top3
0    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

Alternatively,

you can also try without explicitly looping over the groups.

df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values

Output :

which gives you the same output

enter image description here

Answered By: Himanshu Poddar