Parsing BibTeX citation format with Python

Question:

What is the best way in python to parse these results? I have tried regex but can’t get it to work. I am looking for a dictionary of title, author etc as keys.

@article{perry2000epidemiological,
  title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
  author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
  journal={Journal of public health},
  volume={22},
  number={3},
  pages={427--434},
  year={2000},
  publisher={Oxford University Press}
}
Asked By: gmoorevt

||

Answers:

You can use regex:

import re

s = """
  @article{perry2000epidemiological,
  title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
  author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
  journal={Journal of public health},
  volume={22},
  number={3},
  pages={427--434},
  year={2000},
  publisher={Oxford University Press}
}
"""
results = re.findall('(?<=@article{)[a-zA-Z0-9]+|(?<=={)[a-zA-Z0-9:s,]+|[a-zA-Z]+(?==)|@[a-zA-Z0-9]+', s)
final_results = {results[i][1:] if results[i].startswith('@') else results[i]:int(results[i+1]) if results[i+1].isdigit() else results[i+1] for i in range(0, len(results), 2)}

Output:

{'publisher': 'Oxford University Press', 'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others', 'journal': 'Journal of public health', 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study', 'number': 3, 'volume': 22, 'year': 2000, 'article': 'perry2000epidemiological', 'pages': 427}
Answered By: Ajax1234

You might be looking for re.split:

import re
article_dict = {}
with open('inp.txt') as f:
    for line in f.readlines()[1:-1]:
        info = re.split(r'=',line.strip())
        article_dict[info[0]] = info[1]

I’m assuming you will need to get rid of the braces and commas at the end, which is just a simple task of replacing or slicing.

{'title': '{An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},',
 'author': '{Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},', 
 'journal': '{Journal of public health},', 
 'volume': '{22},', 
 'number': '{3},', 
 'pages': '{427--434},', 
 'year': '{2000},', 
 'publisher': '{Oxford University Press}'}
Answered By: adapap

This looks like a citation format. You could parse it like this:

>>> import re

>>> kv = re.compile(r'b(?P<key>w+)={(?P<value>[^}]+)}')

>>> citation = """
... @article{perry2000epidemiological,
...   title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
...  Study},
...   author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
...  Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
...   journal={Journal of public health},
...   volume={22},
...   number={3},
...   pages={427--434},
...   year={2000},
...   publisher={Oxford University Press}
... }
... """

>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
 'journal': 'Journal of public health',
 'number': '3',
 'pages': '427--434',
 'publisher': 'Oxford University Press',
 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
 'volume': '22',
 'year': '2000'}

The regex uses two named capturing groups (mainly just to visually denote what’s what).

  • “key” is any 1+ unicode word character, with a word boundary on the left and literal equals sign to its right;
  • “value” is something inside two curly brackets. You can use [^}] conveniently as long as you don’t expect to have “nested” curly brackets. In other words, the values are just one or more of any characters that aren’t curly brackets, inside of curly brackets.
Answered By: Brad Solomon

You might be looking for a BibTeX-parser: https://bibtexparser.readthedocs.io/en/master/

Source: https://bibtexparser.readthedocs.io/en/master/tutorial.html#step-0-vocabulary

Input/Create bibtex file:

bibtex = """@ARTICLE{Cesar2013,
  author = {Jean César},
  title = {An amazing title},
  year = {2013},
  month = jan,
  volume = {12},
  pages = {12--23},
  journal = {Nice Journal},
  abstract = {This is an abstract. This line should be long enough to test
     multilines...},
  comments = {A comment},
  keywords = {keyword1, keyword2}
}
"""

with open('bibtex.bib', 'w') as bibfile:
    bibfile.write(bibtex)

Parse it:

import bibtexparser

with open('bibtex.bib') as bibtex_file:
    bib_database = bibtexparser.load(bibtex_file)

print(bib_database.entries)

Output:

[{'journal': 'Nice Journal',
  'comments': 'A comment',
  'pages': '12--23',
  'month': 'jan',
  'abstract': 'This is an abstract. This line should be long enough to testnmultilines...',
  'title': 'An amazing title',
  'year': '2013',
  'volume': '12',
  'ID': 'Cesar2013',
  'author': 'Jean César',
  'keyword': 'keyword1, keyword2',
  'ENTRYTYPE': 'article'}]
Answered By: Patrick Artner

Since I had some problems with the other solutions (and I didn’t want to install new libraries), here is my attempt.

Note that this method assumes that all bibliography records are in the format:

@record_type{ record_id,
 key1 = {value1},
 key2 = {value2},
 key3 = ...
}

This is typically the case for all fields with exception of the month field where braces are often missing and for which I added a special edge case.

import re
# load bib file
with open('bib.bib','r') as bibfile:
    content = bibfile.read() 

bib_lookup = {}
# split at @
for s in content.split("@"):
    # Note: add other record types if necessary
    for match_word in ['article','techreport','misc','book']:
        if match_word in s:
            # get record id from first line after "@" ending with ","
            article_id = re.findall(match_word+'{(.*?),', s)
            if article_id:
                # fix month formatting 
                if "month" in s:
                    m = re.findall(',n  month = (.*?),', s)
                    # replace only when curly braces are missing around month
                    if m:                          
                        s = s.replace(f"month = {m[0]},",f"month = {{{m[0]}}},")                

                # regex for keys
                results1 = [r.strip() for r in re.findall(',n  (.*?)=', s)]
                # regex for values
                results2 = [r.strip() for r in re.findall('{(.*?)},', s)]
                res = dict(zip(results1,results2))            
                bib_lookup[article_id[0]] = res
            else:
                print("Warning: unable to parse record")
                print(s)

Answered By: gibbone
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.