Parsing a log file and ignoring text between two targets
Question:
This question is a follow-up to my previous question here: Parsing text and JSON from a log file and keeping them together
I have a log file, your_file.txt
with the following structure and I would like to extract the timestamp, run, user, and json:
A whole bunch of irrelevant text
2022-12-15 12:45:06 garbage, run: 1, user: james json:
[{"value": 30, "error": 8}]
Another stack user was helpful enough to provide this abridged code to extract the relevant pieces:
import re
pat = re.compile(
r'(?ms)^([^,n]+),s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$'
)
with open('your_file.txt', 'r') as f_in:
print(pat.findall(f_in.read()))
Which returns this value which is then processed further:
[('2022-12-15 12:45:06 garbage', '1', 'james', '[{"value": 30, "error": 8}]')]
How can I amend the regex expression used to ignore the word "garbage" after the timestamp so that word is not included in the output of pat.findall
?
Answers:
You can use the date time pattern to match date time first and then the rest of the substring before ,
:
(?ms)^(d{4}-d{2}-d{2} d{2}:d{2}:d{2})[^,n]*,s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$
See the regex demo.
The ([^,n]+)
is replaced with (d{4}-d{2}-d{2} d{2}:d{2}:d{2})[^,n]*
that matches
(d{4}-d{2}-d{2} d{2}:d{2}:d{2})
– Group 1: four digits, two occurrences of -
and then two digits, a space, two digits, and then two occurrences of :
and then two digits
[^,n]*
– zero or more chars other than a comma and newline
This question is a follow-up to my previous question here: Parsing text and JSON from a log file and keeping them together
I have a log file, your_file.txt
with the following structure and I would like to extract the timestamp, run, user, and json:
A whole bunch of irrelevant text
2022-12-15 12:45:06 garbage, run: 1, user: james json:
[{"value": 30, "error": 8}]
Another stack user was helpful enough to provide this abridged code to extract the relevant pieces:
import re
pat = re.compile(
r'(?ms)^([^,n]+),s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$'
)
with open('your_file.txt', 'r') as f_in:
print(pat.findall(f_in.read()))
Which returns this value which is then processed further:
[('2022-12-15 12:45:06 garbage', '1', 'james', '[{"value": 30, "error": 8}]')]
How can I amend the regex expression used to ignore the word "garbage" after the timestamp so that word is not included in the output of pat.findall
?
You can use the date time pattern to match date time first and then the rest of the substring before ,
:
(?ms)^(d{4}-d{2}-d{2} d{2}:d{2}:d{2})[^,n]*,s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$
See the regex demo.
The ([^,n]+)
is replaced with (d{4}-d{2}-d{2} d{2}:d{2}:d{2})[^,n]*
that matches
(d{4}-d{2}-d{2} d{2}:d{2}:d{2})
– Group 1: four digits, two occurrences of-
and then two digits, a space, two digits, and then two occurrences of:
and then two digits[^,n]*
– zero or more chars other than a comma and newline