Parsing a log file and ignoring text between two targets

Question

This question is a follow-up to my previous question here: Parsing text and JSON from a log file and keeping them together

I have a log file, your_file.txt with the following structure and I would like to extract the timestamp, run, user, and json:

A whole bunch of irrelevant text
2022-12-15 12:45:06 garbage, run: 1, user: james json:
[{"value": 30, "error": 8}]

Another stack user was helpful enough to provide this abridged code to extract the relevant pieces:

import re

pat = re.compile(
    r'(?ms)^([^,n]+),s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$'
)

with open('your_file.txt', 'r') as f_in:
    print(pat.findall(f_in.read()))

Which returns this value which is then processed further:

[('2022-12-15 12:45:06 garbage', '1', 'james', '[{"value": 30, "error": 8}]')]

How can I amend the regex expression used to ignore the word "garbage" after the timestamp so that word is not included in the output of pat.findall?

Asked By: DJC

||

Source

Answer 1

You can use the date time pattern to match date time first and then the rest of the substring before ,:

(?ms)^(d{4}-d{2}-d{2} d{2}:d{2}:d{2})[^,n]*,s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$

See the regex demo.

The ([^,n]+) is replaced with (d{4}-d{2}-d{2} d{2}:d{2}:d{2})[^,n]* that matches

(d{4}-d{2}-d{2} d{2}:d{2}:d{2}) – Group 1: four digits, two occurrences of - and then two digits, a space, two digits, and then two occurrences of : and then two digits
[^,n]* – zero or more chars other than a comma and newline

Answered By: Wiktor Stribiżew

Parsing a log file and ignoring text between two targets

Question:

Answers: