How to split up git log output into a list of commits in python?

Question:

Given git log output like such:

commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)
Author: Slim Shady
Date:   Sun Sep 18 19:53:42 2022 -0700

    ci: remove debugging line github action script

    commit body

commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)
Author: Slim Shady
Date:   Sun Sep 18 19:41:20 2022 -0700

    feat: read and write IDs

commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874
Author: Slim Shady
Date:   Sun Sep 18 17:41:03 2022 -0700

    feat: new hook to allow custom tags

I’d like that to turn into a list in python, with each element containing a single commit (including hash, author, body, etc.).

I’ve tried using re.split(r"commit w{40}", git_log), but it doesn’t keep the hash in the output.

Asked By: Jacob Pavlock

||

Answers:

You need to put the split pattern in a capture group to allow it to be part of the output:

# filter(None, ...) to remove empty strings  
>>> res = filter(None, re.split(r'(commit w{40})', inp))
# Join items in group of two to handle the split between a commit line and rest of its body
>>> output = ["".join(item) for item in zip(*[res] * 2)]
>>> output
[
    'commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)nAuthor: Slim ShadynDate:   Sun Sep 18 19:53:42 2022 -0700nn    ci: remove debugging line github action scriptnn    commit bodynn',
    'commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)nAuthor: Slim ShadynDate:   Sun Sep 18 19:41:20 2022 -0700nn    feat: read and write IDsnn',
    'commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874nAuthor: Slim ShadynDate:   Sun Sep 18 17:41:03 2022 -0700nn    feat: new hook to allow custom tags'
]

But if you do have control over the git log output, you could format it differently and parse it without regex:

git log --pretty=format:'"%H"%x09"%an"%x09"%ad"%x09"%B"' > output.csv

Then:

>>> import csv
>>> with open("output.csv") as f:
...     items = list(csv.reader(f, delimiter='t'))
...
>>> items[0]
["19e0f017ac832238f5a800dd3ea7a5966b3c1343", "Slim Shady", "Sun Sep 18 19:53:42 2022 -0700", "ci: remove debugging line github action script"]

Other option is to use libraries like https://gitpython.readthedocs.io/en/stable/ to get access to commits as Python objects you can access easily.

Answered By: Ashwini Chaudhary

You could also use a positive lookahead to split your data.

with open('git_log.txt', 'r') as f:
    data = f.read()
res = list(filter(None, re.split(r"(?=commit w{40})", data)))

Output:

[
    'commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)nAuthor: Slim ShadynDate:   Sun Sep 18 19:53:42 2022 -0700nn    ci: remove debugging line github action scriptnn    commit bodynn',
    'commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)nAuthor: Slim ShadynDate:   Sun Sep 18 19:41:20 2022 -0700nn    feat: read and write IDsnn',
    'commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874nAuthor: Slim ShadynDate:   Sun Sep 18 17:41:03 2022 -0700nn    feat: new hook to allow custom tags'
]
Answered By: Rabinzel
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.