python text parsing to split list into chunks including preceding delimiters

Question:

What I Have

After OCR’ing some public Q&A deposition pdfs which have a Q&A form, I have raw text like the following:

text = """nannQ So I do first want to bring up exhibit No. 46, which is in the binder 
in front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...
nnIs that correct?nnA This is correct.nnQ Okay."""

…which I want to split into the separate questions and answers. Each Question or Answer starts with 'nQ ', 'nA ', 'nQ_' or 'nA_' (e.g. matches regex "n[QA]_?s")

What I’ve Done So Far

I can get a list of all questions and answers with the following code:

pattern = "n[QA]_?s"
q_a_list = re.split(pattern, text)
print(q_a_list)

which yields q_a_list:

['nan', 
'So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...nnnIs that correct?n', 
'This is correct.n', 
'Okay.']

What I Want

This is close to what I want, but has the following problems:

  • It’s not always clear if a statement is a Question or an Answer, and
  • Sometimes, such as in this particular example, the first item in the list may be neither a Question nor Answer, but just random text before the first Q delimiter.

I would like a modified version of the my q_a_list above, but which addresses the two bulleted problems by linking each text chunk to the delimiter that preceded it. Something like:

[{'0': 'nan', 
  'nQ': 'So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...nnnIs that correct?n',
  'nA': 'This is correct.n',
  'nQ': 'Okay.'}]

or

[{'nQ': 'So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...nnnIs that correct?n',
  'nA': 'This is correct.n',
  'nQ': 'Okay.'}]

or maybe even just a list with delimiters pre-pended:

['nQ: So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...nnnIs that correct?n',
'nA: This is correct.n',
'nQ: Okay.'
]
Asked By: Max Power

||

Answers:

This is probably not the most elegant answer, but it seems to work. I won’t accept this for the next few days in case someone posts a better answer:

# this gets me the location (index start & end) of each occurrence of my regex pattern 
delims = list(re.finditer(pattern, text))

# now let's iterate through each pair of delimiter and next-delimiter locations
q_a_list = []

for delim, next_delim in zip(delims[:-1], delims[1:]):

    # pull "Q" or "A" out of the current delimiter
    prefix = text[delim.span()[0]:delim.span()[1]].strip()

    # The actual question or answer text spans from the end of this 
    # delimiter to the start of the next delimiter
    text_chunk = text[delim.span()[1]:next_delim.span()[0]]

    q_a_list.append(f"{prefix}: {text_chunk}")

# q_a_list is missing the final prefix and text_chunk, because
# they have no next_delim, so the zip() above doesn't get to it
final_delim = delims[-1]

final_prefix = text[final_delim.span()[0]: final_delim.span()[1]].strip()
final_text_chunk = text[final_delim.span()[1]:]

q_a_list.append(f"{final_prefix}: {final_text_chunk}")

now the result:

>>> print(q_a_list)
['Q: So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...nnnIs that correct?n', 
'A: This is correct.n', 
'Q: Okay.']
Answered By: Max Power

I’m not entirely sure I understand the question, but I hope this might be helpful:

try,

questions = []
answers = []
for item in text.split('nn'):
    questions.append(item) if item.startswith('Q ' or 'Q_') else answers.append(item)

print(f'questions: {questions}')
print(f'answers: {answers}')

output:

questions: ['Q So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.', 'Q Okay.']
answers: ['na', 'And that is a letter [to] Alstonn& Bird...', 'nIs that correct?', 'A This is correct.']
Answered By: Jamie Dormaar

Take a look at Pawpaw, a framework specifically designed for building lexical parsers that segment text and collect the results into searchable trees. You can use it to easily develop a parser for your problem as follows:

Code:

import regex
from pawpaw import Ito, arborform, visualization

# INPUT
text = """nannQ So I do first want to bring up exhibit No. 46, which is in the binder 
in front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...
nnIs that correct?nnA This is correct.nnQ Okay."""

# BUILD PARSER
itor_split = arborform.Split(regex.compile(r'n+(?=Q_? )', regex.DOTALL), desc='Q/A tuple')

itor_filt = arborform.Filter(lambda i: i.str_startswith('Q'))  # toss "random text" stuff
con = arborform.Connectors.Delegate(itor_filt)
itor_split.connections.append(con)

# Assumes only one answer per question
itor_qa_split = arborform.Split(regex.compile(r'n+(?=A_? )', regex.DOTALL), limit=1)
con = arborform.Connectors.Children.Add(itor_qa_split)
itor_filt.connections.append(con)

itor_extract = arborform.Extract(
    regex.compile(r'([QA])_? (?<QorA>.+)', regex.DOTALL),
    desc_func=lambda ito, match, group: match.group(1))
con = arborform.Connectors.Children.Add(itor_extract)
itor_qa_split.connections.append(con)

# OUTPUT TREE
root = Ito(text)
tree_vis = visualization.pepo.Tree()
for i in itor_split(root):
    print(tree_vis.dumps(i))
print()

# OUTPUT TUPLE
for i, tup in enumerate(itor_split(root)):
    print(f'{tup:%desc} {i:,}:')
    for qa in tup.children:
        print(f't{qa:%desc% : %substr!r}')
    print()

Output:

(4, 176) 'Q/A tuple' : 'Q So I do first want…nA This is correct.'
├──(4, 156) 'Q/A tuple' : 'Q So I do first want…nnIs that correct?'
│  └──(6, 156) 'Q' : 'So I do first want t…nnIs that correct?'
└──(158, 176) 'Q/A tuple' : 'A This is correct.'
   └──(160, 176) 'A' : 'This is correct.'

(178, 185) 'Q/A tuple' : 'Q Okay.'
└──(178, 185) 'Q/A tuple' : 'Q Okay.'
   └──(180, 185) 'Q' : 'Okay.'


Q/A tuple 0:
    Q/A tuple: 'Q So I do first want to bring up exhibit No. 46, which is in the binder nin front ofnyou.nnAnd that is a letter [to] Alstonn& Bird...nnnIs that correct?'
    Q/A tuple: 'A This is correct.'

Q/A tuple 1:
    Q/A tuple: 'Q Okay.'

Lastly, there are some ambiguities in this problem description. For example, can the "random text" can include more than one of any possible character? If so, then it could potentially match known ‘nQ ‘, ‘nA_ ‘, etc. and additional logic would be needed to root this out.

Also, what if an answer contains a sentence starting with ‘A’? For example:

...nQ What are two famous sayings?nA The early bird
gets the worm.nA bird in the hand is worth two in
the bush.n..

The code I provided above will handle this case, however, if it is possible to have multiple answers for a given question, the parser would need to be adjusted.

Answered By: rlayers
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.