Python set.add() is triggering outside a conditional statement

Question:

I’m tokenizing some document and I want to find out which tokens are shared between one or more tokenizations. To do this, for each tokenization, I am looping through the set of all tokens in all tokenizations, called all_tokens and checking if a given token exists in the current tokenization. If it does exist in the current tokenization, I add the index of the tokenization i to a set corresponding to the token in the token_dict dictionary. However, set.add() is somehow being called outside the conditional, resulting in every i being added to the token’s entry in the token_dict. I’ve set up a small toy version here where I check the final token’s entry in the token_dict.

import numpy as np
np.random.seed(42)
all_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
tokenizations = [[all_tokens[np.random.randint(0, len(all_tokens))] for i in range(4)] for i in range(10)]
print(tokenizations)
token_dict = dict.fromkeys(all_tokens, set())
for i, tokenization in enumerate(tokenizations):
    token_set = set(tokenization)
    for token in all_tokens:
        #for each token in the total vocabulary, check if the unique value exists in the tokenization
        if token in token_set:
            token_dict[token].add(i)
        if(token=='g'):
            print(token in token_set)


print(token)
print(token_dict[token])

This results in the output:

[['g', 'd', 'e', 'g'], ['c', 'e', 'e', 'g'], ['b', 'c', 'g', 'c'], ['c', 'e', 'd', 'c'], ['f', 'e', 'b', 'd'], ['f', 'f', 'b', 'd'], ['e', 'a', 'd', 'b'], ['f', 'e', 'd', 'a'], ['a', 'c', 'c', 'g'], ['b', 'd', 'd', 'g']]
True
True
True
False
False
False
False
False
True
True
g
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

As you can see, g does not exist in the tokenizations at indices 3-7 and the conditional correctly identifies this as False, but the indices of those tokenizations are added to the entry of g in the token_dict.

If I change the token_dict values to a list, and use list.append(), the output shows every index appended to the list 3 times.
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9]

I don’t show it here but this is true for all tokens in all_tokens. Why is this happening?

Asked By: DLS

||

Answers:

import numpy as np
np.random.seed(42)

all_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

tokenizations = [[all_tokens[np.random.randint(0, len(all_tokens))] for i in range(4)] for i in range(10)]
token_dict = dict.fromkeys(all_tokens, set())
for i, tokenization in enumerate(tokenizations):
    token_set = set(tokenization)
    for token in all_tokens:
        #for each token in the total vocabulary, check if the unique value exists in the tokenization
        if token in token_set:
            token_dict[token].add(i)
        if(token=='g'):
            print(token in token_set)


print(token)
print(token_dict[token])

Try this may work !!!

Answered By: Shoaib Baloch

set.add is only called inside the conditional. The problem is: there is only one set, and all entries in token_dict point to the exact same set instance.

The culprit is this line:

token_dict = dict.fromkeys(all_tokens, set())

From the python docs:

All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.

Following their advice, you should use a dict comprehension instead:

token_dict = {token: set() for token in all_tokens}

This creates a new set instance for each item in token_dict.

Answered By: danzel
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.