Extract each item from a column of lists and then pick the top items

Question

I have the following DateFrame:

| tag      | list                                                |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato   | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |

The column list is a list of lists with each list having an item and a value between 1 to 0. The lists are arranged in descending order of this value.

I want to extract each item from here and get the top 3 item but not the item itself. Resultant data frame should be:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['B',0.6],['C',0.5],['D',0.3]] |
| B    | [['A',0.9],['C',0.5],['D',0.3]] |
| C    | [['A',0.9],['B',0.6],['D',0.3]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['V',0.7],['W',0.4],['X',0.3]] |
| V    | [['U',0.8],['W',0.4],['X',0.3]] |
| W    | [['U',0.8],['V',0.7],['X',0.3]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

I tried and I am able to extract the value, I am stuck at the part where I want to ignore the item itself while creating the top_3. This is what I have done:

data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]], 
        ['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]

df = pd.DataFrame(data, columns=['tag', 'list'])
df

--

temp = {}
for idx, row in df.iterrows():
    for item in row["list"]:
        temp[item[0]] = row["tag"]

top_items = {}
for idx, row in df.iterrows():
    top_items[row["tag"]] = row["list"]

similar = []
for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]

My result:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['A',0.9],['B',0.6],['C',0.5]] |
| B    | [['A',0.9],['B',0.6],['C',0.5]] |
| C    | [['A',0.9],['B',0.6],['C',0.5]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['U',0.8],['V',0.7],['W',0.4]] |
| V    | [['U',0.8],['V',0.7],['W',0.4]] |
| W    | [['U',0.8],['V',0.7],['W',0.4]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

You see, the top_3 is wrong for A,B,C,U,V,W because in all cases it takes top 3 and thus doesn’t care about the item itself.

The result I get is always bringing the top 3 and I tried to put filters but unable to get it working.

If there are better ways to extract the data than how I did, do let me know ways to optimize it.

Asked By: trojan horse

||

Source

Answer 1

In this part you are missing an if/else condition, you just take the 3 first items ignoring that you should not take the same item key in case is in the top 3

for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

Solution would be, remove the item from top_3 first, and then get the "sample"

for item, category in temp.items():
    top_3 = top_items.get(category)
    top_3_without_item = [x for x in top_3 if x[0] != item]
    sample = top_3_without_item[:3]
    similar.append([item, sample])

Answered By: AlvaroSch

Answer 2

As starting point, you can explode your list column then merge on itself. Next, you have to remove rows where the two list columns are equal and finally group the top 3 values:

out = df.explode('list')

out = (out.merge(df1, on='tag').query('list_x != list_y')
          .sort_values('list_y', key=lambda x: x.str[1], ascending=False)
          .assign(item=lambda x: x.pop('list_x').str[0])
          .groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
          .rename('top_3').reset_index())

Output:

>>> out
        tag item                           top_3
0  icecream    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1  icecream    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2  icecream    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3  icecream    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4  icecream    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    potato    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    potato    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    potato    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    potato    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    potato    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

Answered By: Corralien

Answer 3

You can replicate each list with the number of element it has using pandas.DataFrame.reindex and then you can group the elements using pandas.DataFrame.groupby and then iterate through the groups

df = df.reindex(df.index.repeat(df.list.apply(len)))

similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
    for index, rows in enumerate(df_group):
        similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])

Output :

This gives you the expected output :

  item                            top3
0    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

Alternatively,

you can also try without explicitly looping over the groups.

df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values

Output :

which gives you the same output

Answered By: Himanshu Poddar

Extract each item from a column of lists and then pick the top items

Question:

Answers:

Output :

Output :