Parsing BibTeX citation format with Python
Question:
What is the best way in python to parse these results? I have tried regex but can’t get it to work. I am looking for a dictionary of title, author etc as keys.
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
Answers:
You can use regex:
import re
s = """
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
"""
results = re.findall('(?<=@article{)[a-zA-Z0-9]+|(?<=={)[a-zA-Z0-9:s,]+|[a-zA-Z]+(?==)|@[a-zA-Z0-9]+', s)
final_results = {results[i][1:] if results[i].startswith('@') else results[i]:int(results[i+1]) if results[i+1].isdigit() else results[i+1] for i in range(0, len(results), 2)}
Output:
{'publisher': 'Oxford University Press', 'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others', 'journal': 'Journal of public health', 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study', 'number': 3, 'volume': 22, 'year': 2000, 'article': 'perry2000epidemiological', 'pages': 427}
You might be looking for re.split
:
import re
article_dict = {}
with open('inp.txt') as f:
for line in f.readlines()[1:-1]:
info = re.split(r'=',line.strip())
article_dict[info[0]] = info[1]
I’m assuming you will need to get rid of the braces and commas at the end, which is just a simple task of replacing or slicing.
{'title': '{An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},',
'author': '{Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},',
'journal': '{Journal of public health},',
'volume': '{22},',
'number': '{3},',
'pages': '{427--434},',
'year': '{2000},',
'publisher': '{Oxford University Press}'}
This looks like a citation format. You could parse it like this:
>>> import re
>>> kv = re.compile(r'b(?P<key>w+)={(?P<value>[^}]+)}')
>>> citation = """
... @article{perry2000epidemiological,
... title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
... Study},
... author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
... Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
... journal={Journal of public health},
... volume={22},
... number={3},
... pages={427--434},
... year={2000},
... publisher={Oxford University Press}
... }
... """
>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
'journal': 'Journal of public health',
'number': '3',
'pages': '427--434',
'publisher': 'Oxford University Press',
'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
'volume': '22',
'year': '2000'}
The regex uses two named capturing groups (mainly just to visually denote what’s what).
- “key” is any 1+ unicode word character, with a word boundary on the left and literal equals sign to its right;
- “value” is something inside two curly brackets. You can use
[^}]
conveniently as long as you don’t expect to have “nested” curly brackets. In other words, the values are just one or more of any characters that aren’t curly brackets, inside of curly brackets.
You might be looking for a BibTeX-parser: https://bibtexparser.readthedocs.io/en/master/
Source: https://bibtexparser.readthedocs.io/en/master/tutorial.html#step-0-vocabulary
Input/Create bibtex file:
bibtex = """@ARTICLE{Cesar2013,
author = {Jean César},
title = {An amazing title},
year = {2013},
month = jan,
volume = {12},
pages = {12--23},
journal = {Nice Journal},
abstract = {This is an abstract. This line should be long enough to test
multilines...},
comments = {A comment},
keywords = {keyword1, keyword2}
}
"""
with open('bibtex.bib', 'w') as bibfile:
bibfile.write(bibtex)
Parse it:
import bibtexparser
with open('bibtex.bib') as bibtex_file:
bib_database = bibtexparser.load(bibtex_file)
print(bib_database.entries)
Output:
[{'journal': 'Nice Journal',
'comments': 'A comment',
'pages': '12--23',
'month': 'jan',
'abstract': 'This is an abstract. This line should be long enough to testnmultilines...',
'title': 'An amazing title',
'year': '2013',
'volume': '12',
'ID': 'Cesar2013',
'author': 'Jean César',
'keyword': 'keyword1, keyword2',
'ENTRYTYPE': 'article'}]
Since I had some problems with the other solutions (and I didn’t want to install new libraries), here is my attempt.
Note that this method assumes that all bibliography records are in the format:
@record_type{ record_id,
key1 = {value1},
key2 = {value2},
key3 = ...
}
This is typically the case for all fields with exception of the month
field where braces are often missing and for which I added a special edge case.
import re
# load bib file
with open('bib.bib','r') as bibfile:
content = bibfile.read()
bib_lookup = {}
# split at @
for s in content.split("@"):
# Note: add other record types if necessary
for match_word in ['article','techreport','misc','book']:
if match_word in s:
# get record id from first line after "@" ending with ","
article_id = re.findall(match_word+'{(.*?),', s)
if article_id:
# fix month formatting
if "month" in s:
m = re.findall(',n month = (.*?),', s)
# replace only when curly braces are missing around month
if m:
s = s.replace(f"month = {m[0]},",f"month = {{{m[0]}}},")
# regex for keys
results1 = [r.strip() for r in re.findall(',n (.*?)=', s)]
# regex for values
results2 = [r.strip() for r in re.findall('{(.*?)},', s)]
res = dict(zip(results1,results2))
bib_lookup[article_id[0]] = res
else:
print("Warning: unable to parse record")
print(s)
What is the best way in python to parse these results? I have tried regex but can’t get it to work. I am looking for a dictionary of title, author etc as keys.
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
You can use regex:
import re
s = """
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
"""
results = re.findall('(?<=@article{)[a-zA-Z0-9]+|(?<=={)[a-zA-Z0-9:s,]+|[a-zA-Z]+(?==)|@[a-zA-Z0-9]+', s)
final_results = {results[i][1:] if results[i].startswith('@') else results[i]:int(results[i+1]) if results[i+1].isdigit() else results[i+1] for i in range(0, len(results), 2)}
Output:
{'publisher': 'Oxford University Press', 'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others', 'journal': 'Journal of public health', 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study', 'number': 3, 'volume': 22, 'year': 2000, 'article': 'perry2000epidemiological', 'pages': 427}
You might be looking for re.split
:
import re
article_dict = {}
with open('inp.txt') as f:
for line in f.readlines()[1:-1]:
info = re.split(r'=',line.strip())
article_dict[info[0]] = info[1]
I’m assuming you will need to get rid of the braces and commas at the end, which is just a simple task of replacing or slicing.
{'title': '{An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},',
'author': '{Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},',
'journal': '{Journal of public health},',
'volume': '{22},',
'number': '{3},',
'pages': '{427--434},',
'year': '{2000},',
'publisher': '{Oxford University Press}'}
This looks like a citation format. You could parse it like this:
>>> import re
>>> kv = re.compile(r'b(?P<key>w+)={(?P<value>[^}]+)}')
>>> citation = """
... @article{perry2000epidemiological,
... title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
... Study},
... author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
... Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
... journal={Journal of public health},
... volume={22},
... number={3},
... pages={427--434},
... year={2000},
... publisher={Oxford University Press}
... }
... """
>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
'journal': 'Journal of public health',
'number': '3',
'pages': '427--434',
'publisher': 'Oxford University Press',
'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
'volume': '22',
'year': '2000'}
The regex uses two named capturing groups (mainly just to visually denote what’s what).
- “key” is any 1+ unicode word character, with a word boundary on the left and literal equals sign to its right;
- “value” is something inside two curly brackets. You can use
[^}]
conveniently as long as you don’t expect to have “nested” curly brackets. In other words, the values are just one or more of any characters that aren’t curly brackets, inside of curly brackets.
You might be looking for a BibTeX-parser: https://bibtexparser.readthedocs.io/en/master/
Source: https://bibtexparser.readthedocs.io/en/master/tutorial.html#step-0-vocabulary
Input/Create bibtex file:
bibtex = """@ARTICLE{Cesar2013, author = {Jean César}, title = {An amazing title}, year = {2013}, month = jan, volume = {12}, pages = {12--23}, journal = {Nice Journal}, abstract = {This is an abstract. This line should be long enough to test multilines...}, comments = {A comment}, keywords = {keyword1, keyword2} } """ with open('bibtex.bib', 'w') as bibfile: bibfile.write(bibtex)
Parse it:
import bibtexparser with open('bibtex.bib') as bibtex_file: bib_database = bibtexparser.load(bibtex_file) print(bib_database.entries)
Output:
[{'journal': 'Nice Journal', 'comments': 'A comment', 'pages': '12--23', 'month': 'jan', 'abstract': 'This is an abstract. This line should be long enough to testnmultilines...', 'title': 'An amazing title', 'year': '2013', 'volume': '12', 'ID': 'Cesar2013', 'author': 'Jean César', 'keyword': 'keyword1, keyword2', 'ENTRYTYPE': 'article'}]
Since I had some problems with the other solutions (and I didn’t want to install new libraries), here is my attempt.
Note that this method assumes that all bibliography records are in the format:
@record_type{ record_id,
key1 = {value1},
key2 = {value2},
key3 = ...
}
This is typically the case for all fields with exception of the month
field where braces are often missing and for which I added a special edge case.
import re
# load bib file
with open('bib.bib','r') as bibfile:
content = bibfile.read()
bib_lookup = {}
# split at @
for s in content.split("@"):
# Note: add other record types if necessary
for match_word in ['article','techreport','misc','book']:
if match_word in s:
# get record id from first line after "@" ending with ","
article_id = re.findall(match_word+'{(.*?),', s)
if article_id:
# fix month formatting
if "month" in s:
m = re.findall(',n month = (.*?),', s)
# replace only when curly braces are missing around month
if m:
s = s.replace(f"month = {m[0]},",f"month = {{{m[0]}}},")
# regex for keys
results1 = [r.strip() for r in re.findall(',n (.*?)=', s)]
# regex for values
results2 = [r.strip() for r in re.findall('{(.*?)},', s)]
res = dict(zip(results1,results2))
bib_lookup[article_id[0]] = res
else:
print("Warning: unable to parse record")
print(s)