Random seed for order of elements in Python's Set to List conversion
Question:
I was executing some code in a Jupyter notebook and noticed that each time I ran it, the output was different despite not explicitly putting randomness in my program.
I narrowed it down to a line that removes all repeated elements from a list.
l = list(set(l))
I noticed two things:
-
If I re-run the same code in the same Jupyter kernel, I always get the same output for l, but
-
If I open up another notebook, I get a different output.
Is there some kind of hidden random seed that is used for the set -> list conversion for a given kernel? How does it work under the hood, and what would I do if I wanted deterministic output from the above code?
Answers:
A set
functions almost the same as dict
, with the hash
of your object as the key. The default __hash__
function of most objects (in CPython) relies on their id
, which in turn relies on their address in the memory.
New kernel means objects have a different address, which means a different id
, a different hash
, and a different order of the iterator that the set gives.
This is implementation-dependent, so you cannot rely on it, all I can say is CPython, so far, works this way. The thing you can rely on is set
not being (usefully) ordered.
If you need ordering, keep both the list and the set. If you want to remove repeats while preserving order, something like this will work:
def could_add(s, x):
if x in s:
return False
else:
s.add(x)
return True
seen = set()
[x for x in l if could_add(seen, x)]
(Though I fully agree with Barmar’s comment — if order matters, they should be sortable.)
You can use OrderedDict
instead of set
to removes all repeated elements from a list and keep its order.
If you use python>=3.6, dict
will also keep its order as the same as OrderedDict
.
# python < 3.6
from collections import OrderedDict
res = list(OrderedDict.fromkeys(yourlist))
# pyton >= 3.6
res = list(dict.fromkeys(yourlist))
I was executing some code in a Jupyter notebook and noticed that each time I ran it, the output was different despite not explicitly putting randomness in my program.
I narrowed it down to a line that removes all repeated elements from a list.
l = list(set(l))
I noticed two things:
-
If I re-run the same code in the same Jupyter kernel, I always get the same output for l, but
-
If I open up another notebook, I get a different output.
Is there some kind of hidden random seed that is used for the set -> list conversion for a given kernel? How does it work under the hood, and what would I do if I wanted deterministic output from the above code?
A set
functions almost the same as dict
, with the hash
of your object as the key. The default __hash__
function of most objects (in CPython) relies on their id
, which in turn relies on their address in the memory.
New kernel means objects have a different address, which means a different id
, a different hash
, and a different order of the iterator that the set gives.
This is implementation-dependent, so you cannot rely on it, all I can say is CPython, so far, works this way. The thing you can rely on is set
not being (usefully) ordered.
If you need ordering, keep both the list and the set. If you want to remove repeats while preserving order, something like this will work:
def could_add(s, x):
if x in s:
return False
else:
s.add(x)
return True
seen = set()
[x for x in l if could_add(seen, x)]
(Though I fully agree with Barmar’s comment — if order matters, they should be sortable.)
You can use OrderedDict
instead of set
to removes all repeated elements from a list and keep its order.
If you use python>=3.6, dict
will also keep its order as the same as OrderedDict
.
# python < 3.6
from collections import OrderedDict
res = list(OrderedDict.fromkeys(yourlist))
# pyton >= 3.6
res = list(dict.fromkeys(yourlist))