python: Dictionary counter items. How to optimize performance
Question:
I need to keep track of occurrences of items in parsed files in a structure as follows:
data= {'tomate': {'John': 2, 'mike': 2},
'pasta': {'mike': 1},
'wine': {'mike': 1, 'alex': 2}}
the dictionary starts as empty one. data = {} and its is being populated with values when they are not there and adding up users +1 if the user is there or =1 if the user is not yet there.
This is my code:
def modify(data, key, client):
try:
# if both keys are there
data[key][client] = data[key][client] +1
#print(f"{key} is there. {client} is there.")
except:
try:
data[key][client] = 1
#print(f"{client} not there, but {key}")
except:
data[key]={}
data[key][client] = 1
#print(f"{client} and {key} added")
return data
It works:
key="wine"
client = "alex"
modify(d, key, client)
giving:
{'tomate': {'John': 2, 'mike': 2},
'pasta': {'mike': 1},
'wine': {'mike': 1, 'alex': 3}}
The question is if using try/except is not the way to go, i.e. not pythonic, because I am relying on exceptions for the code to work properly which I feel is a little bit weird and might make the whole code much slower.
Is there any other option to keep track of a counter like this that I might have a look to that might be faster?
Performance is very important of course since I have to count hundreds of millions of times.
EDIT 1:
Evaluating the speed of a proposed solution I have this comparison that strikes me since my solution is faster (and I would not expect that):
EDIT 2:
This question has been closed despite of very well elaborated answers and being directed to finding a proper way to solve a problem. It is really annoying that people just close questions like this. How can someone say that this questions is opinion-based if we are talking about computing times?
Answers:
You can rewrite the function using if-else
or using dict.setdefault
and get rid of Exceptions:
def modify(data, key, client):
data.setdefault(key, {}).setdefault(client, 0)
data[key][client] += 1
data = {}
modify(data, "tomate", "John")
modify(data, "tomate", "John")
modify(data, "tomate", "mike")
modify(data, "tomate", "mike")
modify(data, "pasta", "mike")
modify(data, "wike", "mike")
modify(data, "wike", "alex")
modify(data, "wike", "alex")
print(data)
Prints:
{
"tomate": {"John": 2, "mike": 2},
"pasta": {"mike": 1},
"wike": {"mike": 1, "alex": 2},
}
Using if-else
:
def modify(data, key, client):
if key not in data:
data[key] = {}
if client in data[key]:
data[key][client] += 1
else:
data[key][client] = 1
EDIT: One-liner version:
def modify(data, key, client):
x[client] = (x := data.setdefault(key, {})).get(client, 0) + 1
One step to a more pythonic solution is to catch exception that you want to catch (instead of any exception):
try:
...
except KeyError:
...
Second, you can check whether the element is in the dictionary instead of try
ing:
if key not in data:
data[key] = {}
if client not in data[key]:
data[key][client] = 0
data[ket][client] += 1
You can also use defaultdict
that does these checks for you. In my experience and experiments all these solutions a very close in terms of execution time so choose which one looks better for you.
With try/except version, the search for index is done only once (most of the time). It serves both the purpose of checking the presence of index, and altering the associated value. That is why your version is the fastest so far. Last one-liner from Andrej is very fast also, especially with lot of new indexes (because then, your version spend some time in except, whereas Andrej’s one-liner doesn’t really care)
One notable, seemingly minor but important improvement to your version could be:
def modify(data, key, client):
try:
# if both keys are there
data[key][client] += 1
#print(f"{key} is there. {client} is there.")
except:
try:
data[key][client] = 1
#print(f"{client} not there, but {key}")
except:
data[key]={}
data[key][client] = 1
#print(f"{client} and {key} added")
return data
Yes, just that +=
changes quite a lot. Especially with lot of "updates" (I mean, not new index). Because x=x+1
means that python needs to figure out what is x
twice. Whereas x+=1
finds the l-value only once. Which is roughly the same usually. Except when finding the place itself is a non-negligible part of the cost. As it is here, with indexation
Some benchmarking on big sets of generated data
Experiment 1: data with lot of new indexes.
Version
time
Yours
5.11
Mine (that is yours with +=)
4.44
Andrej’s 2-liner
6.21
Andrej/Yevhen’s if-else
5.69
Get based
5.99
Andrej’s one-liner
5.08
Experiment 2, with lots of collisions (that is not new index. That is other than 1 values…)
Version
time
Yours
5.43
My +=
4.67
Andrej’s 2-liner
6.73
if/else
6.23
.get
6.51
Andrej’s 1-liner
5.60
So, in both cases, your version, upgraded with my += suggestion, is the fastest. If lot of new indexes, then Andrej’s one-liner slightly beats your version (without +=). If lot of existing indexes, it is roughly the same (again, because then you use more except clauses for new indexes).
+=
improvement make your version the best in any cases.
The bare except
is dangerous: there are other reasons for exceptions to be thrown (user interrupt, out of memory, …). Catch the specific exception, but beware that both data[key]
and data[key][client]
can throw, so you’ll want to identify which level needs inserting.
Consider using a defaultdict
to insert any needed key, and a Counter
to track how many orders each client has:
from collections import Counter, defaultdict
data= defaultdict(Counter,
{'tomate': Counter({'John': 2, 'mike': 2}),
'pasta': Counter({'mike': 1}),
'wine': Counter({'mike': 1, 'alex': 2})})
def modify(data, key, client):
data[key][client] += 1
I need to keep track of occurrences of items in parsed files in a structure as follows:
data= {'tomate': {'John': 2, 'mike': 2},
'pasta': {'mike': 1},
'wine': {'mike': 1, 'alex': 2}}
the dictionary starts as empty one. data = {} and its is being populated with values when they are not there and adding up users +1 if the user is there or =1 if the user is not yet there.
This is my code:
def modify(data, key, client):
try:
# if both keys are there
data[key][client] = data[key][client] +1
#print(f"{key} is there. {client} is there.")
except:
try:
data[key][client] = 1
#print(f"{client} not there, but {key}")
except:
data[key]={}
data[key][client] = 1
#print(f"{client} and {key} added")
return data
It works:
key="wine"
client = "alex"
modify(d, key, client)
giving:
{'tomate': {'John': 2, 'mike': 2},
'pasta': {'mike': 1},
'wine': {'mike': 1, 'alex': 3}}
The question is if using try/except is not the way to go, i.e. not pythonic, because I am relying on exceptions for the code to work properly which I feel is a little bit weird and might make the whole code much slower.
Is there any other option to keep track of a counter like this that I might have a look to that might be faster?
Performance is very important of course since I have to count hundreds of millions of times.
EDIT 1:
Evaluating the speed of a proposed solution I have this comparison that strikes me since my solution is faster (and I would not expect that):
EDIT 2:
This question has been closed despite of very well elaborated answers and being directed to finding a proper way to solve a problem. It is really annoying that people just close questions like this. How can someone say that this questions is opinion-based if we are talking about computing times?
You can rewrite the function using if-else
or using dict.setdefault
and get rid of Exceptions:
def modify(data, key, client):
data.setdefault(key, {}).setdefault(client, 0)
data[key][client] += 1
data = {}
modify(data, "tomate", "John")
modify(data, "tomate", "John")
modify(data, "tomate", "mike")
modify(data, "tomate", "mike")
modify(data, "pasta", "mike")
modify(data, "wike", "mike")
modify(data, "wike", "alex")
modify(data, "wike", "alex")
print(data)
Prints:
{
"tomate": {"John": 2, "mike": 2},
"pasta": {"mike": 1},
"wike": {"mike": 1, "alex": 2},
}
Using if-else
:
def modify(data, key, client):
if key not in data:
data[key] = {}
if client in data[key]:
data[key][client] += 1
else:
data[key][client] = 1
EDIT: One-liner version:
def modify(data, key, client):
x[client] = (x := data.setdefault(key, {})).get(client, 0) + 1
One step to a more pythonic solution is to catch exception that you want to catch (instead of any exception):
try:
...
except KeyError:
...
Second, you can check whether the element is in the dictionary instead of try
ing:
if key not in data:
data[key] = {}
if client not in data[key]:
data[key][client] = 0
data[ket][client] += 1
You can also use defaultdict
that does these checks for you. In my experience and experiments all these solutions a very close in terms of execution time so choose which one looks better for you.
With try/except version, the search for index is done only once (most of the time). It serves both the purpose of checking the presence of index, and altering the associated value. That is why your version is the fastest so far. Last one-liner from Andrej is very fast also, especially with lot of new indexes (because then, your version spend some time in except, whereas Andrej’s one-liner doesn’t really care)
One notable, seemingly minor but important improvement to your version could be:
def modify(data, key, client):
try:
# if both keys are there
data[key][client] += 1
#print(f"{key} is there. {client} is there.")
except:
try:
data[key][client] = 1
#print(f"{client} not there, but {key}")
except:
data[key]={}
data[key][client] = 1
#print(f"{client} and {key} added")
return data
Yes, just that +=
changes quite a lot. Especially with lot of "updates" (I mean, not new index). Because x=x+1
means that python needs to figure out what is x
twice. Whereas x+=1
finds the l-value only once. Which is roughly the same usually. Except when finding the place itself is a non-negligible part of the cost. As it is here, with indexation
Some benchmarking on big sets of generated data
Experiment 1: data with lot of new indexes.
Version | time |
---|---|
Yours | 5.11 |
Mine (that is yours with +=) | 4.44 |
Andrej’s 2-liner | 6.21 |
Andrej/Yevhen’s if-else | 5.69 |
Get based | 5.99 |
Andrej’s one-liner | 5.08 |
Experiment 2, with lots of collisions (that is not new index. That is other than 1 values…)
Version | time |
---|---|
Yours | 5.43 |
My += | 4.67 |
Andrej’s 2-liner | 6.73 |
if/else | 6.23 |
.get | 6.51 |
Andrej’s 1-liner | 5.60 |
So, in both cases, your version, upgraded with my += suggestion, is the fastest. If lot of new indexes, then Andrej’s one-liner slightly beats your version (without +=). If lot of existing indexes, it is roughly the same (again, because then you use more except clauses for new indexes).
+=
improvement make your version the best in any cases.
The bare except
is dangerous: there are other reasons for exceptions to be thrown (user interrupt, out of memory, …). Catch the specific exception, but beware that both data[key]
and data[key][client]
can throw, so you’ll want to identify which level needs inserting.
Consider using a defaultdict
to insert any needed key, and a Counter
to track how many orders each client has:
from collections import Counter, defaultdict
data= defaultdict(Counter,
{'tomate': Counter({'John': 2, 'mike': 2}),
'pasta': Counter({'mike': 1}),
'wine': Counter({'mike': 1, 'alex': 2})})
def modify(data, key, client):
data[key][client] += 1