how to make using regex dict with sum of values instead of their overwrite

Question:

I’m new in python. I have a log file with content like this:

[14:43:28]Toyota Camry/BH1488XO/service:complex/employee:Oleg/price:550
[15:56:15]Nissan Almera/BE0348CH/service:outside+interior/employee:Serega/price:450
[15:59:44]VW Amarok /BH138E/service:complex/employee:Oleg/price:700
[16:00:48]BMW X7/BH1155HH/service:2-phase complex+plastic /employee:Sasha/price:1400
[16:02:38]Jeep Renegade/BE6782IK/service:wash/employee:Serega/price:300
[16:03:19]MB C300/BT4500BT/service:complex/employee:Sasha/price:550
[16:04:19]MB C200/BT4400HT/service:complex/employee:Sasha/price:1000

I need to make a dict which will content an employees as a key and a sum of his prices like {"Oleg": 1250}

i used this code to make lis of employees:


    with open ("17082022.log", "r") as file:
        text = file.read()
    emp_list = set(re.findall(r'employee:(.*)/', text))

and this to make list of prices


    output_pluses = re.findall(r"(?<=price:)[+-]?d+", text)

Asked By: Palma

||

Answers:

You can use re.findall with capturing groups to get employee name and price in one step. Next, create a dictionary:

import re

log = """
[14:43:28]Toyota Camry/BH1488XO/service:complex/employee:Oleg/price:550
[15:56:15]Nissan Almera/BE0348CH/service:outside+interior/employee:Serega/price:450
[15:59:44]VW Amarok /BH138E/service:complex/employee:Oleg/price:700
[16:00:48]BMW X7/BH1155HH/service:2-phase complex+plastic /employee:Sasha/price:1400
[16:02:38]Jeep Renegade/BE6782IK/service:wash/employee:Serega/price:300
[16:03:19]MB C300/BT4500BT/service:complex/employee:Sasha/price:550
[16:04:19]MB C200/BT4400HT/service:complex/employee:Sasha/price:1000"""

out = {}
for employee, price in re.findall(r"employee:([^/]+)/price:(d+)", log):
    out[employee] = out.get(employee, 0) + int(price)

print(out)

Prints:

{'Oleg': 1250, 'Serega': 750, 'Sasha': 2950}
Answered By: Andrej Kesely

Another option is to use the .split() function. The advantage is that this way it is not necessary to import the re module and use advanced knowledge about designing regular expressions:

log = """
[14:43:28]Toyota Camry/BH1488XO/service:complex/employee:Oleg/price:550
[15:56:15]Nissan Almera/BE0348CH/service:outside+interior/employee:Serega/price:450
[15:59:44]VW Amarok /BH138E/service:complex/employee:Oleg/price:700
[16:00:48]BMW X7/BH1155HH/service:2-phase complex+plastic /employee:Sasha/price:1400
[16:02:38]Jeep Renegade/BE6782IK/service:wash/employee:Serega/price:300
[16:03:19]MB C300/BT4500BT/service:complex/employee:Sasha/price:550
[16:04:19]MB C200/BT4400HT/service:complex/employee:Sasha/price:1000"""

dct = {}
for line in log.split('n'):
    employee, price = line.split('/employee:')[1].split('/price:')
    dct[employee] = dct.get(employee, 0) + int(price)
print(dct) # gives {'Oleg': 1250, 'Serega': 750, 'Sasha': 2950}

The ‘trick’ with short dct.get(employee, 0) code is that if the employee isn’t yet in dictionary the value 0 will be returned as price, what is equivalent to (dct[employee] if employee in dct else 0) what is then a shortened version of an if-statement going over multiple lines.

Another advantage of using the .split() approach over regular expression search is that it will most probably result in a notification with an error message if the lines in the log-file have an unexpected format or content, where the regular expression search approach will just deliver a (wrong) result.

For extremely large log-files the regular expression search approach runs about 10% faster, but for small log-files time required for loading the re module makes it much slower compared to the .split() approach.

Answered By: Claudio
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.