Finding multiple supersets and subsets for values in a column with python

Question:

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id letter
1 A, B, D, E, F
2 B, C
3 B
4 D, B
5 B, D, A
6 X, Y, Z
7 X, Y
8 E, D
7 G
8 G

For e.g.

  • ‘B’, ‘D,B’, ‘E,D’, ‘B,D,A’ are subsets of ‘A,B,D,E,F’,
  • ‘B’ is a subset of ‘B,C’,
  • ‘X,Y’ is a subset of ‘X,Y,Z’,
  • ‘G’ is a subset of ‘G’.

and

  • ‘A,B,D,E,F’, ‘B,C’, ‘X,Y,Z’ and ‘G’ are supersets.

I would like to show and store that relation in the separate excel files, first one includes (subsets and their supersets) second one includes supersets, First file:

id letter
1 A, B, D, E, F
5 B,D,A
8 E,D
4 D,B
3 B
2 B,C
3 B
6 X, Y, Z
7 X, Y
7 G
8 G

Second file:

id letter
1 A, B, D, E, F
2 B,C
6 X, Y, Z
7 G
Asked By: abcabc

||

Answers:

One possible solution could be using itertools.combinations and check in every combination if all elements of the one item is in the other.

To find the supersets we take the letter column and convert it to a list of tuples. Then we create all possible combinations each with two elements of that column.
The line a,b = ... is to find the shorter element in that specific combination. a is always the shorter element.
If every letter of a is in b and a is in list out, then we remove it from the list because it is a subset of another element. At the end, out only contains the supersets of your data.
Then we only have to change the elements of the list to joined strings again and filter the df with that list to get your 2nd file (here called df2)

You need to be aware of how you split your strings in the beginning and also joining in the end. If there leading or trailing whitespaces in your data, you need to strip them, otherwise in the end the filter wouldn’t match the rows.

EDIT
If you want to get rid of the duplicates at the end, you just need to add .drop_duplicates(subset='letter') at the end after filtering your df2. subset needs to be defined here, since both rows with G have a different value for id, so it wouldn’t be considered as duplicate.

df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})

lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
print(lst)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('B',), ('D', 'B'), ('B', 'D', 'A'), ('X', 'Y', 'Z'), ('X', 'Y'), ('E', 'D')]

out = lst[:] #copy of lst

for tup1,tup2 in itertools.combinations(lst, 2):
    a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
    # e.g for a,b : (('D','B'), ('B', 'D', 'A'))
    if all(elem in b for elem in a) and a in out:
        out.remove(a)

print(out)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('X', 'Y', 'Z')]

filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)

Output:

   id         letter
0   1  A, B, D, E, F
1   2           B, C
5   6        X, Y, Z
8   9              G

Additional Question
get id’s of sublists from superset:

You can create a mapping each row of df with id as key and the sublists as value. Then loop through df2 and check if all elements of the sublist are in the supersets.

mapping = df.set_index('id')['letter'].str.split(', ').to_dict()
print(mapping)
{1: ['A', 'B', 'D', 'E', 'F'],
 2: ['B', 'C'],
 3: ['B'],
 4: ['D', 'B'],
 5: ['B', 'D', 'A'],
 6: ['X', 'Y', 'Z'],
 7: ['X', 'Y'],
 8: ['E', 'D'],
 9: ['G'],
 10: ['G']}

Create new column:

#create helper function
def func(row):
    sublists = []
    for key,value in mapping.items():
        check = [val in row for val in value]
        if all(check):
            sublists.append(key)
    return sublists

# apply on each row of df2
df2['sublists'] = [func(row) for row in df2['letter']]
print(df2)
   id         letter         sublists
0   1  A, B, D, E, F  [1, 3, 4, 5, 8]
1   2           B, C           [2, 3]
5   6        X, Y, Z           [6, 7]
8   9              G          [9, 10]

or as oneliner if you like to:

df2['sublists'] = [[key for key,value in mapping.items() if all(val in row for val in value)] for row in df2['letter']]
df2
Answered By: Rabinzel
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.