Python set.add() is triggering outside a conditional statement
Question:
I’m tokenizing some document and I want to find out which tokens are shared between one or more tokenizations. To do this, for each tokenization, I am looping through the set of all tokens in all tokenizations, called all_tokens
and checking if a given token exists in the current tokenization. If it does exist in the current tokenization, I add the index of the tokenization i
to a set corresponding to the token in the token_dict
dictionary. However, set.add()
is somehow being called outside the conditional, resulting in every i
being added to the token’s entry in the token_dict
. I’ve set up a small toy version here where I check the final token’s entry in the token_dict
.
import numpy as np
np.random.seed(42)
all_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
tokenizations = [[all_tokens[np.random.randint(0, len(all_tokens))] for i in range(4)] for i in range(10)]
print(tokenizations)
token_dict = dict.fromkeys(all_tokens, set())
for i, tokenization in enumerate(tokenizations):
token_set = set(tokenization)
for token in all_tokens:
#for each token in the total vocabulary, check if the unique value exists in the tokenization
if token in token_set:
token_dict[token].add(i)
if(token=='g'):
print(token in token_set)
print(token)
print(token_dict[token])
This results in the output:
[['g', 'd', 'e', 'g'], ['c', 'e', 'e', 'g'], ['b', 'c', 'g', 'c'], ['c', 'e', 'd', 'c'], ['f', 'e', 'b', 'd'], ['f', 'f', 'b', 'd'], ['e', 'a', 'd', 'b'], ['f', 'e', 'd', 'a'], ['a', 'c', 'c', 'g'], ['b', 'd', 'd', 'g']]
True
True
True
False
False
False
False
False
True
True
g
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
As you can see, g
does not exist in the tokenizations at indices 3-7 and the conditional correctly identifies this as False
, but the indices of those tokenizations are added to the entry of g
in the token_dict
.
If I change the token_dict
values to a list
, and use list.append()
, the output shows every index appended to the list
3 times.
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9]
I don’t show it here but this is true for all tokens in all_tokens
. Why is this happening?
Answers:
import numpy as np
np.random.seed(42)
all_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
tokenizations = [[all_tokens[np.random.randint(0, len(all_tokens))] for i in range(4)] for i in range(10)]
token_dict = dict.fromkeys(all_tokens, set())
for i, tokenization in enumerate(tokenizations):
token_set = set(tokenization)
for token in all_tokens:
#for each token in the total vocabulary, check if the unique value exists in the tokenization
if token in token_set:
token_dict[token].add(i)
if(token=='g'):
print(token in token_set)
print(token)
print(token_dict[token])
Try this may work !!!
set.add
is only called inside the conditional. The problem is: there is only one set, and all entries in token_dict
point to the exact same set instance.
The culprit is this line:
token_dict = dict.fromkeys(all_tokens, set())
From the python docs:
All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.
Following their advice, you should use a dict comprehension instead:
token_dict = {token: set() for token in all_tokens}
This creates a new set
instance for each item in token_dict
.
I’m tokenizing some document and I want to find out which tokens are shared between one or more tokenizations. To do this, for each tokenization, I am looping through the set of all tokens in all tokenizations, called all_tokens
and checking if a given token exists in the current tokenization. If it does exist in the current tokenization, I add the index of the tokenization i
to a set corresponding to the token in the token_dict
dictionary. However, set.add()
is somehow being called outside the conditional, resulting in every i
being added to the token’s entry in the token_dict
. I’ve set up a small toy version here where I check the final token’s entry in the token_dict
.
import numpy as np
np.random.seed(42)
all_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
tokenizations = [[all_tokens[np.random.randint(0, len(all_tokens))] for i in range(4)] for i in range(10)]
print(tokenizations)
token_dict = dict.fromkeys(all_tokens, set())
for i, tokenization in enumerate(tokenizations):
token_set = set(tokenization)
for token in all_tokens:
#for each token in the total vocabulary, check if the unique value exists in the tokenization
if token in token_set:
token_dict[token].add(i)
if(token=='g'):
print(token in token_set)
print(token)
print(token_dict[token])
This results in the output:
[['g', 'd', 'e', 'g'], ['c', 'e', 'e', 'g'], ['b', 'c', 'g', 'c'], ['c', 'e', 'd', 'c'], ['f', 'e', 'b', 'd'], ['f', 'f', 'b', 'd'], ['e', 'a', 'd', 'b'], ['f', 'e', 'd', 'a'], ['a', 'c', 'c', 'g'], ['b', 'd', 'd', 'g']]
True
True
True
False
False
False
False
False
True
True
g
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
As you can see, g
does not exist in the tokenizations at indices 3-7 and the conditional correctly identifies this as False
, but the indices of those tokenizations are added to the entry of g
in the token_dict
.
If I change the token_dict
values to a list
, and use list.append()
, the output shows every index appended to the list
3 times.
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9]
I don’t show it here but this is true for all tokens in all_tokens
. Why is this happening?
import numpy as np
np.random.seed(42)
all_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
tokenizations = [[all_tokens[np.random.randint(0, len(all_tokens))] for i in range(4)] for i in range(10)]
token_dict = dict.fromkeys(all_tokens, set())
for i, tokenization in enumerate(tokenizations):
token_set = set(tokenization)
for token in all_tokens:
#for each token in the total vocabulary, check if the unique value exists in the tokenization
if token in token_set:
token_dict[token].add(i)
if(token=='g'):
print(token in token_set)
print(token)
print(token_dict[token])
Try this may work !!!
set.add
is only called inside the conditional. The problem is: there is only one set, and all entries in token_dict
point to the exact same set instance.
The culprit is this line:
token_dict = dict.fromkeys(all_tokens, set())
From the python docs:
All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.
Following their advice, you should use a dict comprehension instead:
token_dict = {token: set() for token in all_tokens}
This creates a new set
instance for each item in token_dict
.