Python regex to extract most recent digit preceding a keyword

Question:

I have a list of references in text as shown below where the texts in bold is what I want to extract using re.findall().

’10. T. BESLEY, POLITICAL SELECTION. J. ECON. PERSPECT. 19, 43–60 (2005). 11. J. D. FEARON, CAMBRIDGE STUDIES IN THE THEORY OF DEMOCRACY, IN DEMOCRACY, ACCOUNTABILITY, AND REPRESENTATION, A. PRZEWORSKI, B. MANIN, S. C. STOKES, EDS. (CAMBRIDGE UNIV. PRESS, 1999), PP. 55–97. 12. B. B. DE MESQUITA, A. SMITH, THE DICTATOR’S HANDBOOK: WHY BAD BEHAVIOR IS ALMOST ALWAYS GOOD POLITICS (HACHETTE UK, 2011). 13. S. WONG, S. E. GUGGENHEIM, “COMMUNITY-DRIVEN DEVELOPMENT: MYTHS AND REALITIES” (WPS8435, THE WORLD BANK, 2018), PP. 1–36. 14. A. BEATH, F. CHRISTIA, R. ENIKOLOPOV, DIRECT DEMOCRACY AND RESOURCE ALLOCATION: EXPERIMENTAL EVIDENCE FROM AFGHANISTAN. J. DEV. ECON. 124, 199–213 (2017). 15. B. A. OLKEN, DIRECT DEMOCRACY AND LOCAL PUBLIC GOODS: EVIDENCE FROM A FIELD EXPERIMENT IN INDONESIA. AM. POLIT. SCI. REV. 104, 243–267 (2010). 16. A. BLAKE, M. J. GILLIGAN, INTERNATIONAL INTERVENTIONS TO BUILD SOCIAL CAPITAL: EVIDENCE FROM A FIELD EXPERIMENT IN SUDAN. AM. POLIT. SCI. REV. 109, 427–449 (2015)’

Essentially, I would like to grab the reference number (here, 16) followed by the citation in interest up to the citation’s published year (here, 2015). Because I have the first author’s last name in a list, I can use ‘BLAKE’ as a keyword, but everything else needs to be matched using regex.

So far I’ve tried:

re.findall('d+?.*?BLAKE.*?d{4}', refer, re.DOTALL)

But this grabs everything above, since d+ matches ’10.’, not ’16.’. I thought .*? would minimize the string match between the digit and Blake, but it’s not. An alternative option is to give a range instead of .*, like re.findall('d+?{0,5}BLAKE.*?d{4}', refer, re.DOTALL) but I’m doing this for many other texts and I cannot know in advance how many texts there will be between the reference number and the first author’s last name.

Is there a way to grab the most recent digit (here, 16) preceding a keyword (BLAKE) here? Or a way to minimize the search between digit and a keyword?

Asked By: Les D

||

Answers:

You can use this regex to match the digits in front of the author name that you are looking for:

d+(?=.s*(?:(?:[A-Z].s*)+[A-Z]+,s*)*(?:[A-Z].s*)+BLAKE)

It looks for

  • digits (d+) that are followed by
  • a full-stop and space (.s*)
  • 0 or more author names, represented as Initial. (Initial.)* Name, ((?:(?:[A-Z].s)+[A-Z]+,s)
  • 1 or more initials before the matching author name ((?:[A-Z].s)+BLAKE))

Regex demo on regex101

In python we can parameterise the regex (using an f-string) to make it easy to change the name we are searching for:

names = ['BLAKE', 'CHRISTIA', 'GUGGENHEIM']
for name in names:
    print(re.findall(fr'd+(?=.s*(?:(?:[A-Z].s*)+[A-Z]+,s*)*(?:[A-Z].s*)+{name})', refer))

Output:

['16']
['14']
['13']
Answered By: Nick

If you’re guaranteed not to have any other digits in between the reference number and the "keyword" you’re searching for, the below should do the trick:

re.findall('d+?[A-Z.s,]+BLAKE.*?d{4}', text, re.DOTALL)

For an explanation of why this works, the expression [A-Z.s,]+ is a character class that will match any upper-case letter, the literal ., whitespace, and a comma.

UPDATE: I just now reread your question, and you said you wanted to extract the number only, not the entire reference. For that, Nick’s answer suffices. I’ll keep my answer here, though, in case it helps answer any other questions…

Answered By: fireshadow52

This should work:

re.findall(r'(d+)D+BLAKE.*?d{4}', refer)
['16']

It simply says get "a number followed by a bunch of non-numbers followed by BLAKE". Because d+ is in a capturing group, findall returns just the number for you.

If you’d like to build a single expression for all the authors (rather than loop findall), you can dynamically build out the regular expression to include all the author names and call findall once.:

re.findall(r'(d+)D+(?:CHRISTIA|BLAKE).*?d{4}', refer)
['14', '16']

You can add as many authors in the parentheses as you like, separated by a pipe |.

This will return just the list of ids. If you want to pair them with the author, remove the ?:

re.findall(r'(d+)D+(CHRISTIA|BLAKE).*?d{4}', refer)
[('14', 'CHRISTIA'), ('16', 'BLAKE')]

The ?: makes it a non-capturing group, which tells findall to ignore what’s in it when returning a match. Otherwise it will return anything that’s in parentheses. From the findall() documentation:

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups.

Answered By: Steven

Split by reference number then search for keyword(s)

COMMENT: I originally thought the author’s last name was an example of a possible keyword. Not the specific keyword that was always going to be used. The following solution allows searching for any one or multiple keywords in a reference.

The solution

import re

reference_text = "10. T. BESLEY, POLITICAL SELECTION. J. ECON. PERSPECT. 19, 43–60 (2005). 11. J. D. FEARON, CAMBRIDGE STUDIES IN THE THEORY OF DEMOCRACY, IN DEMOCRACY, ACCOUNTABILITY, AND REPRESENTATION, A. PRZEWORSKI, B. MANIN, S. C. STOKES, EDS. (CAMBRIDGE UNIV. PRESS, 1999), PP. 55–97. 12. B. B. DE MESQUITA, A. SMITH, THE DICTATOR’S HANDBOOK: WHY BAD BEHAVIOR IS ALMOST ALWAYS GOOD POLITICS (HACHETTE UK, 2011). 13. S. WONG, S. E. GUGGENHEIM, “COMMUNITY-DRIVEN DEVELOPMENT: MYTHS AND REALITIES” (WPS8435, THE WORLD BANK, 2018), PP. 1–36. 14. A. BEATH, F. CHRISTIA, R. ENIKOLOPOV, DIRECT DEMOCRACY AND RESOURCE ALLOCATION: EXPERIMENTAL EVIDENCE FROM AFGHANISTAN. J. DEV. ECON. 124, 199–213 (2017). 15. B. A. OLKEN, DIRECT DEMOCRACY AND LOCAL PUBLIC GOODS: EVIDENCE FROM A FIELD EXPERIMENT IN INDONESIA. AM. POLIT. SCI. REV. 104, 243–267 (2010). 16. A. BLAKE, M. J. GILLIGAN, INTERNATIONAL INTERVENTIONS TO BUILD SOCIAL CAPITAL: EVIDENCE FROM A FIELD EXPERIMENT IN SUDAN. AM. POLIT. SCI. REV. 109, 427–449 (2015)."

reference_number_regex = re.compile(r"(?:^| )(d+).")

references_split_raw = reference_number_regex.split(reference_text)

# trash first element since it is the empty string because references start with a delimeter
flat_references = references_split_raw[1:]

def pair(iterable):
    i = iter(iterable)
    while (p1 := next(i, None)) is not None and (p2 := next(i,None)) is not None:
        yield (p1, p2)

for reference_number, reference in pair(flat_references):
    if "BLAKE" in reference:
        print(reference_number)

REGEX SPLIT

Match the reference number using a regex and split reference_text.

references_split_raw = ['', '10', 'T. ... (2005)', '11', 'J. D. FEARON ...',...]

Pop off the empty string on the front giving flat_references.

PAIR GENERATOR

pair is used to take the flat list and pair up the reference number with its reference.
pair(...) takes an iterable object yielding paired up elements as a tuple.

pair(1, 'a', 2, 'b') => (1,'a'), (2, 'b')

KEYWORD SEARCH

Perform any type of keyword search on the reference and print matching reference number

Answered By: SargeATM