Parsing text and JSON from a log file and keeping them together
Question:
I have a .log file containing both text strings and json. For example:
A whole bunch of irrelevant text
2022-12-15 12:45:06, run: 1, user: james json:
[{"value": "30", "error": "8"}]
2022-12-15 12:47:36, run: 2, user: kelly json:
[{"value": "15", "error": "3"}]
More irrelevant text
My goal is to extract the json but keep it paired with the text that comes before it so that the two can be tied together. The keyword that indicates the start of a new section is run
. However, as shown in the example below, I need to extract the timestamp from the same line where run
appears. The character that indicates the end of a section is ]
.
My goal is to parse this text into a pandas dataframe like the following:
timestamp run user value error
2022-12-15 12:45:06 1 james 30 5
2022-12-15 12:47:36 2 kelly 15 8
Answers:
To extract the json data and the timestamp from the text file, you can use Regex to search for the expression that indicates the start of a new section. In this case, the pattern you’re looking for is the time followed by the keyword "run". In Python this would look like:
with open("file.log", "r") as f:
text = f.read()
matches = re.findall(r"(d{4}-d{2}-d{2} d{2}:d{2}:d{2}), run: d+, user: w+ json:", text)
data = []
for match in matches:
timestamp, run, user, value, error = re.search(r"^(d{4}-d{2}-d{2} d{2}:d{2}:d{2}), run: (d+), user: (w+) json: [{"value": (d+), "error": (d+)}]$", match).groups()
data.append((timestamp, int(run), user, int(value), int(error)))
# Tuples => DataFrame
df = pd.DataFrame(data, columns=["timestamp", "run", "user", "value", "error"])
`
Try:
import re
import pandas as pd
pat = re.compile(
r"(?ms)^([^,n]+),s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$"
)
all_data = []
with open("your_file.txt", "r") as f_in:
for timestamp, run, user, json_line in pat.findall(f_in.read()):
json_line = json.loads(json_line)
all_data.append(
{
"timestamp": timestamp,
"run": run,
"user": user,
"value": json_line[0]["value"],
"error": json_line[0]["error"],
}
)
df = pd.DataFrame(all_data)
print(df)
Prints:
timestamp run user value error
0 2022-12-15 12:45:06 1 james 30 5
1 2022-12-15 12:47:36 2 kelly 15 8
Sometimes people find regular expressions hard to follow or maintain, so I wrote pyparsing to make code that parses this kind of data easier to read. (This code uses the jsonValue parser from the jsonParser.py example in the pyparsing examples directory.) This may also be easier to modify in the future, if your data format changes.
sample = """
uninteresting text
2022-12-15 12:45:06, run: 1, user: james json:
[{"value": "30", "error": "8"}]
2022-12-15 12:47:36, run: 2, user: kelly json:
[{"value": "15", "error": "3"}]
uninteresting text
"""
import pyparsing as pp
from .jsonParser import jsonValue
timestamp = pp.common.iso8601_datetime()
integer = pp.common.integer()
user = pp.Word(pp.alphas, pp.alphanums + "_")
COMMA = pp.Suppress(",")
record = (timestamp("timestamp") + COMMA
+ "run:" + integer("run") + COMMA
+ "user:" + user("user")
+ "json:" + jsonValue("data"))
def annotate_record(tokens):
return {
"timestamp": tokens.timestamp,
"run": tokens.run,
"user": tokens.user,
"value": tokens.data[0]["value"],
"error": tokens.data[0]["error"]
}
record.add_parse_action(annotate_record)
for match in record.search_string(sample):
print(match.dump())
Prints:
[{'timestamp': '2022-12-15 12:45:06', 'run': 1, 'user': 'james', 'value': '30', 'error': '8'}]
[{'timestamp': '2022-12-15 12:47:36', 'run': 2, 'user': 'kelly', 'value': '15', 'error': '3'}]
If you just want a pretty tabular output, littletable is much lighter weight than pandas (and can also do other simple tabular functions, such as CSV import/export):
import littletable as lt
data_table = lt.Table().insert_many(rec[0] for rec in record.search_string(sample))
data_table.present()
littletable uses the rich package for table presentation.
Timestamp Run User Value Error
───────────────────────────────────────────────────
2022-12-15 12:45:06 1 james 30 8
2022-12-15 12:47:36 2 kelly 15 3
I have a .log file containing both text strings and json. For example:
A whole bunch of irrelevant text
2022-12-15 12:45:06, run: 1, user: james json:
[{"value": "30", "error": "8"}]
2022-12-15 12:47:36, run: 2, user: kelly json:
[{"value": "15", "error": "3"}]
More irrelevant text
My goal is to extract the json but keep it paired with the text that comes before it so that the two can be tied together. The keyword that indicates the start of a new section is run
. However, as shown in the example below, I need to extract the timestamp from the same line where run
appears. The character that indicates the end of a section is ]
.
My goal is to parse this text into a pandas dataframe like the following:
timestamp run user value error
2022-12-15 12:45:06 1 james 30 5
2022-12-15 12:47:36 2 kelly 15 8
To extract the json data and the timestamp from the text file, you can use Regex to search for the expression that indicates the start of a new section. In this case, the pattern you’re looking for is the time followed by the keyword "run". In Python this would look like:
with open("file.log", "r") as f:
text = f.read()
matches = re.findall(r"(d{4}-d{2}-d{2} d{2}:d{2}:d{2}), run: d+, user: w+ json:", text)
data = []
for match in matches:
timestamp, run, user, value, error = re.search(r"^(d{4}-d{2}-d{2} d{2}:d{2}:d{2}), run: (d+), user: (w+) json: [{"value": (d+), "error": (d+)}]$", match).groups()
data.append((timestamp, int(run), user, int(value), int(error)))
# Tuples => DataFrame
df = pd.DataFrame(data, columns=["timestamp", "run", "user", "value", "error"])
`
Try:
import re
import pandas as pd
pat = re.compile(
r"(?ms)^([^,n]+),s*run:s*(S+),s*user:s*(.*?)s*json:n(.*?)$"
)
all_data = []
with open("your_file.txt", "r") as f_in:
for timestamp, run, user, json_line in pat.findall(f_in.read()):
json_line = json.loads(json_line)
all_data.append(
{
"timestamp": timestamp,
"run": run,
"user": user,
"value": json_line[0]["value"],
"error": json_line[0]["error"],
}
)
df = pd.DataFrame(all_data)
print(df)
Prints:
timestamp run user value error
0 2022-12-15 12:45:06 1 james 30 5
1 2022-12-15 12:47:36 2 kelly 15 8
Sometimes people find regular expressions hard to follow or maintain, so I wrote pyparsing to make code that parses this kind of data easier to read. (This code uses the jsonValue parser from the jsonParser.py example in the pyparsing examples directory.) This may also be easier to modify in the future, if your data format changes.
sample = """
uninteresting text
2022-12-15 12:45:06, run: 1, user: james json:
[{"value": "30", "error": "8"}]
2022-12-15 12:47:36, run: 2, user: kelly json:
[{"value": "15", "error": "3"}]
uninteresting text
"""
import pyparsing as pp
from .jsonParser import jsonValue
timestamp = pp.common.iso8601_datetime()
integer = pp.common.integer()
user = pp.Word(pp.alphas, pp.alphanums + "_")
COMMA = pp.Suppress(",")
record = (timestamp("timestamp") + COMMA
+ "run:" + integer("run") + COMMA
+ "user:" + user("user")
+ "json:" + jsonValue("data"))
def annotate_record(tokens):
return {
"timestamp": tokens.timestamp,
"run": tokens.run,
"user": tokens.user,
"value": tokens.data[0]["value"],
"error": tokens.data[0]["error"]
}
record.add_parse_action(annotate_record)
for match in record.search_string(sample):
print(match.dump())
Prints:
[{'timestamp': '2022-12-15 12:45:06', 'run': 1, 'user': 'james', 'value': '30', 'error': '8'}]
[{'timestamp': '2022-12-15 12:47:36', 'run': 2, 'user': 'kelly', 'value': '15', 'error': '3'}]
If you just want a pretty tabular output, littletable is much lighter weight than pandas (and can also do other simple tabular functions, such as CSV import/export):
import littletable as lt
data_table = lt.Table().insert_many(rec[0] for rec in record.search_string(sample))
data_table.present()
littletable uses the rich package for table presentation.
Timestamp Run User Value Error
───────────────────────────────────────────────────
2022-12-15 12:45:06 1 james 30 8
2022-12-15 12:47:36 2 kelly 15 3