How to split up git log output into a list of commits in python?
Question:
Given git log output like such:
commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)
Author: Slim Shady
Date: Sun Sep 18 19:53:42 2022 -0700
ci: remove debugging line github action script
commit body
commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)
Author: Slim Shady
Date: Sun Sep 18 19:41:20 2022 -0700
feat: read and write IDs
commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874
Author: Slim Shady
Date: Sun Sep 18 17:41:03 2022 -0700
feat: new hook to allow custom tags
I’d like that to turn into a list in python, with each element containing a single commit (including hash, author, body, etc.).
I’ve tried using re.split(r"commit w{40}", git_log)
, but it doesn’t keep the hash in the output.
Answers:
You need to put the split pattern in a capture group to allow it to be part of the output:
# filter(None, ...) to remove empty strings
>>> res = filter(None, re.split(r'(commit w{40})', inp))
# Join items in group of two to handle the split between a commit line and rest of its body
>>> output = ["".join(item) for item in zip(*[res] * 2)]
>>> output
[
'commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)nAuthor: Slim ShadynDate: Sun Sep 18 19:53:42 2022 -0700nn ci: remove debugging line github action scriptnn commit bodynn',
'commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)nAuthor: Slim ShadynDate: Sun Sep 18 19:41:20 2022 -0700nn feat: read and write IDsnn',
'commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874nAuthor: Slim ShadynDate: Sun Sep 18 17:41:03 2022 -0700nn feat: new hook to allow custom tags'
]
But if you do have control over the git log
output, you could format it differently and parse it without regex:
git log --pretty=format:'"%H"%x09"%an"%x09"%ad"%x09"%B"' > output.csv
Then:
>>> import csv
>>> with open("output.csv") as f:
... items = list(csv.reader(f, delimiter='t'))
...
>>> items[0]
["19e0f017ac832238f5a800dd3ea7a5966b3c1343", "Slim Shady", "Sun Sep 18 19:53:42 2022 -0700", "ci: remove debugging line github action script"]
Other option is to use libraries like https://gitpython.readthedocs.io/en/stable/ to get access to commits as Python objects you can access easily.
You could also use a positive lookahead to split your data.
with open('git_log.txt', 'r') as f:
data = f.read()
res = list(filter(None, re.split(r"(?=commit w{40})", data)))
Output:
[
'commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)nAuthor: Slim ShadynDate: Sun Sep 18 19:53:42 2022 -0700nn ci: remove debugging line github action scriptnn commit bodynn',
'commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)nAuthor: Slim ShadynDate: Sun Sep 18 19:41:20 2022 -0700nn feat: read and write IDsnn',
'commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874nAuthor: Slim ShadynDate: Sun Sep 18 17:41:03 2022 -0700nn feat: new hook to allow custom tags'
]
Given git log output like such:
commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)
Author: Slim Shady
Date: Sun Sep 18 19:53:42 2022 -0700
ci: remove debugging line github action script
commit body
commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)
Author: Slim Shady
Date: Sun Sep 18 19:41:20 2022 -0700
feat: read and write IDs
commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874
Author: Slim Shady
Date: Sun Sep 18 17:41:03 2022 -0700
feat: new hook to allow custom tags
I’d like that to turn into a list in python, with each element containing a single commit (including hash, author, body, etc.).
I’ve tried using re.split(r"commit w{40}", git_log)
, but it doesn’t keep the hash in the output.
You need to put the split pattern in a capture group to allow it to be part of the output:
# filter(None, ...) to remove empty strings
>>> res = filter(None, re.split(r'(commit w{40})', inp))
# Join items in group of two to handle the split between a commit line and rest of its body
>>> output = ["".join(item) for item in zip(*[res] * 2)]
>>> output
[
'commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)nAuthor: Slim ShadynDate: Sun Sep 18 19:53:42 2022 -0700nn ci: remove debugging line github action scriptnn commit bodynn',
'commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)nAuthor: Slim ShadynDate: Sun Sep 18 19:41:20 2022 -0700nn feat: read and write IDsnn',
'commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874nAuthor: Slim ShadynDate: Sun Sep 18 17:41:03 2022 -0700nn feat: new hook to allow custom tags'
]
But if you do have control over the git log
output, you could format it differently and parse it without regex:
git log --pretty=format:'"%H"%x09"%an"%x09"%ad"%x09"%B"' > output.csv
Then:
>>> import csv
>>> with open("output.csv") as f:
... items = list(csv.reader(f, delimiter='t'))
...
>>> items[0]
["19e0f017ac832238f5a800dd3ea7a5966b3c1343", "Slim Shady", "Sun Sep 18 19:53:42 2022 -0700", "ci: remove debugging line github action script"]
Other option is to use libraries like https://gitpython.readthedocs.io/en/stable/ to get access to commits as Python objects you can access easily.
You could also use a positive lookahead to split your data.
with open('git_log.txt', 'r') as f:
data = f.read()
res = list(filter(None, re.split(r"(?=commit w{40})", data)))
Output:
[
'commit 19e0f017ac832238f5a800dd3ea7a5966b3c1343 (HEAD -> master, origin/master, origin/HEAD)nAuthor: Slim ShadynDate: Sun Sep 18 19:53:42 2022 -0700nn ci: remove debugging line github action scriptnn commit bodynn',
'commit ef82c672d21d70c43f0454b0b4d6fa22ef4ad0a9 (fix_release_action)nAuthor: Slim ShadynDate: Sun Sep 18 19:41:20 2022 -0700nn feat: read and write IDsnn',
'commit 8ee8fcbebcab76a2fbf0ee096a0d216e51fe2874nAuthor: Slim ShadynDate: Sun Sep 18 17:41:03 2022 -0700nn feat: new hook to allow custom tags'
]