Finding multiple supersets and subsets for values in a column with python
Question:
I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:
id
letter
1
A, B, D, E, F
2
B, C
3
B
4
D, B
5
B, D, A
6
X, Y, Z
7
X, Y
8
E, D
7
G
8
G
For e.g.
- ‘B’, ‘D,B’, ‘E,D’, ‘B,D,A’ are subsets of ‘A,B,D,E,F’,
- ‘B’ is a subset of ‘B,C’,
- ‘X,Y’ is a subset of ‘X,Y,Z’,
- ‘G’ is a subset of ‘G’.
and
- ‘A,B,D,E,F’, ‘B,C’, ‘X,Y,Z’ and ‘G’ are supersets.
I would like to show and store that relation in the separate excel files, first one includes (subsets and their supersets) second one includes supersets, First file:
id
letter
1
A, B, D, E, F
5
B,D,A
8
E,D
4
D,B
3
B
2
B,C
3
B
6
X, Y, Z
7
X, Y
7
G
8
G
Second file:
id
letter
1
A, B, D, E, F
2
B,C
6
X, Y, Z
7
G
Answers:
One possible solution could be using itertools.combinations
and check in every combination if all elements of the one item is in the other.
To find the supersets we take the letter
column and convert it to a list of tuples. Then we create all possible combinations each with two elements of that column.
The line a,b = ...
is to find the shorter element in that specific combination. a
is always the shorter element.
If every letter of a
is in b
and a
is in list out
, then we remove it from the list because it is a subset of another element. At the end, out
only contains the supersets
of your data.
Then we only have to change the elements of the list to joined strings again and filter the df
with that list to get your 2nd file (here called df2
)
You need to be aware of how you split your strings in the beginning and also joining in the end. If there leading or trailing whitespaces in your data, you need to strip them, otherwise in the end the filter wouldn’t match the rows.
EDIT
If you want to get rid of the duplicates at the end, you just need to add .drop_duplicates(subset='letter')
at the end after filtering your df2. subset
needs to be defined here, since both rows with G
have a different value for id
, so it wouldn’t be considered as duplicate.
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})
lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
print(lst)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('B',), ('D', 'B'), ('B', 'D', 'A'), ('X', 'Y', 'Z'), ('X', 'Y'), ('E', 'D')]
out = lst[:] #copy of lst
for tup1,tup2 in itertools.combinations(lst, 2):
a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
# e.g for a,b : (('D','B'), ('B', 'D', 'A'))
if all(elem in b for elem in a) and a in out:
out.remove(a)
print(out)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('X', 'Y', 'Z')]
filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)
Output:
id letter
0 1 A, B, D, E, F
1 2 B, C
5 6 X, Y, Z
8 9 G
Additional Question
get id’s of sublists from superset:
You can create a mapping each row of df
with id
as key and the sublists as value. Then loop through df2
and check if all elements of the sublist are in the supersets.
mapping = df.set_index('id')['letter'].str.split(', ').to_dict()
print(mapping)
{1: ['A', 'B', 'D', 'E', 'F'],
2: ['B', 'C'],
3: ['B'],
4: ['D', 'B'],
5: ['B', 'D', 'A'],
6: ['X', 'Y', 'Z'],
7: ['X', 'Y'],
8: ['E', 'D'],
9: ['G'],
10: ['G']}
Create new column:
#create helper function
def func(row):
sublists = []
for key,value in mapping.items():
check = [val in row for val in value]
if all(check):
sublists.append(key)
return sublists
# apply on each row of df2
df2['sublists'] = [func(row) for row in df2['letter']]
print(df2)
id letter sublists
0 1 A, B, D, E, F [1, 3, 4, 5, 8]
1 2 B, C [2, 3]
5 6 X, Y, Z [6, 7]
8 9 G [9, 10]
or as oneliner if you like to:
df2['sublists'] = [[key for key,value in mapping.items() if all(val in row for val in value)] for row in df2['letter']]
df2
I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:
id | letter |
---|---|
1 | A, B, D, E, F |
2 | B, C |
3 | B |
4 | D, B |
5 | B, D, A |
6 | X, Y, Z |
7 | X, Y |
8 | E, D |
7 | G |
8 | G |
For e.g.
- ‘B’, ‘D,B’, ‘E,D’, ‘B,D,A’ are subsets of ‘A,B,D,E,F’,
- ‘B’ is a subset of ‘B,C’,
- ‘X,Y’ is a subset of ‘X,Y,Z’,
- ‘G’ is a subset of ‘G’.
and
- ‘A,B,D,E,F’, ‘B,C’, ‘X,Y,Z’ and ‘G’ are supersets.
I would like to show and store that relation in the separate excel files, first one includes (subsets and their supersets) second one includes supersets, First file:
id | letter |
---|---|
1 | A, B, D, E, F |
5 | B,D,A |
8 | E,D |
4 | D,B |
3 | B |
2 | B,C |
3 | B |
6 | X, Y, Z |
7 | X, Y |
7 | G |
8 | G |
Second file:
id | letter |
---|---|
1 | A, B, D, E, F |
2 | B,C |
6 | X, Y, Z |
7 | G |
One possible solution could be using itertools.combinations
and check in every combination if all elements of the one item is in the other.
To find the supersets we take the letter
column and convert it to a list of tuples. Then we create all possible combinations each with two elements of that column.
The line a,b = ...
is to find the shorter element in that specific combination. a
is always the shorter element.
If every letter of a
is in b
and a
is in list out
, then we remove it from the list because it is a subset of another element. At the end, out
only contains the supersets
of your data.
Then we only have to change the elements of the list to joined strings again and filter the df
with that list to get your 2nd file (here called df2
)
You need to be aware of how you split your strings in the beginning and also joining in the end. If there leading or trailing whitespaces in your data, you need to strip them, otherwise in the end the filter wouldn’t match the rows.
EDIT
If you want to get rid of the duplicates at the end, you just need to add .drop_duplicates(subset='letter')
at the end after filtering your df2. subset
needs to be defined here, since both rows with G
have a different value for id
, so it wouldn’t be considered as duplicate.
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})
lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
print(lst)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('B',), ('D', 'B'), ('B', 'D', 'A'), ('X', 'Y', 'Z'), ('X', 'Y'), ('E', 'D')]
out = lst[:] #copy of lst
for tup1,tup2 in itertools.combinations(lst, 2):
a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
# e.g for a,b : (('D','B'), ('B', 'D', 'A'))
if all(elem in b for elem in a) and a in out:
out.remove(a)
print(out)
# [('A', 'B', 'D', 'E', 'F'), ('B', 'C'), ('X', 'Y', 'Z')]
filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)
Output:
id letter
0 1 A, B, D, E, F
1 2 B, C
5 6 X, Y, Z
8 9 G
Additional Question
get id’s of sublists from superset:
You can create a mapping each row of df
with id
as key and the sublists as value. Then loop through df2
and check if all elements of the sublist are in the supersets.
mapping = df.set_index('id')['letter'].str.split(', ').to_dict()
print(mapping)
{1: ['A', 'B', 'D', 'E', 'F'],
2: ['B', 'C'],
3: ['B'],
4: ['D', 'B'],
5: ['B', 'D', 'A'],
6: ['X', 'Y', 'Z'],
7: ['X', 'Y'],
8: ['E', 'D'],
9: ['G'],
10: ['G']}
Create new column:
#create helper function
def func(row):
sublists = []
for key,value in mapping.items():
check = [val in row for val in value]
if all(check):
sublists.append(key)
return sublists
# apply on each row of df2
df2['sublists'] = [func(row) for row in df2['letter']]
print(df2)
id letter sublists
0 1 A, B, D, E, F [1, 3, 4, 5, 8]
1 2 B, C [2, 3]
5 6 X, Y, Z [6, 7]
8 9 G [9, 10]
or as oneliner if you like to:
df2['sublists'] = [[key for key,value in mapping.items() if all(val in row for val in value)] for row in df2['letter']]
df2