pyparsing syntax tree from named value list

Question:

I’d like to parse tag/value descriptions using the delimiters :, and •

E.g. the Input would be:

Name:Test•Title: Test•Keywords: A,B,C

the expected result should be the name value dict

{
"name": "Test",
"title": "Title",
"keywords: "A,B,C"
}

potentially already splitting the keywords in "A,B,C" to a list. (This is a minor detail since the python built in split method of string will happily do this).

Also applying a mapping

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}

as a mapping between names and dict keys would be helpful but could be a separate step.

I tried the code below https://trinket.io/python3/8dbbc783c7

# pyparsing named values
# Wolfgang Fahl
# 2023-01-28 for Stackoverflow question
import pyparsing as pp
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
keys={
  "Name": "name",
  "Titel": "title",
  "Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"
name_values_grammar=pp.delimited_list(
  pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
  +":"+pp.Suppress(pp.Optional(pp.White()))
  +pp.delimited_list(
    pp.OneOrMore(pp.Word(pp.printables+" ", exclude_chars=",:"))
        ,delim=",")("value")
    ,delim=runDelim).setResultsName("tag", list_all_matches=True)
results=name_values_grammar.parseString(notes_text)
print(results.dump())

and variations of it but i am not even close to the expected result. Currently the dump shows:

['Name', ':', 'Test']
 - key: 'Name'
 - tag: [['Name', ':', 'Test']]
  [0]:
    ['Name', ':', 'Test']
 - value: ['Test']

Seems i don’t know how to define the grammar and work on the parseresult in a way to get the needed dict result.

The main questions for me are:

  • Should i use parse actions?
  • How is the naming of part results done?
  • How is the navigation of the resulting tree done?
  • How is it possible to get the list back from delimitedList?
  • What does list_all_matches=True achieve – it’s behavior seems strange

I searched for answers on the above questions here on stackoverflow and i couldn’t find a consistent picture of what to do.

PyParsing seems to be a great tool but i find it very unintuitive. There are fortunately lots of answers here so i hope to learn how to get this example working

Trying myself i took a stepwise approach:

First i checked the delimitedList behavior see https://trinket.io/python3/25e60884eb

# Try out pyparsing delimitedList
# WF 2023-01-28
from pyparsing import printables, OneOrMore, Word, delimitedList

notes_text="A,B,C"

comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")

grammar = comma_separated_values
result=grammar.parseString(notes_text)
print(f"result:{result}")
print(f"dump:{result.dump()}")
print(f"asDict:{result.asDict()}")
print(f"asList:{result.asList()}")

which returns

result:['A', 'B', 'C']
dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
asDict:{'clist': ['A', 'B', 'C']}
asList:['A', 'B', 'C']

which looks promising and the key success factor seems to be to name this list with "clist" and the default behavior looks fine.

https://trinket.io/python3/bc2517e25a
shows in more detail where the problem is.

# Try out pyparsing delimitedList
# see https://stackoverflow.com/q/75266188/1497139
# WF 2023-01-28
from pyparsing import printables, oneOf, OneOrMore,Optional, ParseResults, Suppress,White, Word, delimitedList

def show_result(title:str,result:ParseResults):
  """
  show pyparsing result details
  
  Args:
     result(ParseResults)
  """
  print(f"result for {title}:")
  print(f"  result:{result}")
  print(f"  dump:{result.dump()}")
  print(f"  asDict:{result.asDict()}")
  print(f"  asList:{result.asList()}")
  # asXML is deprecated and doesn't work any more
  # print(f"asXML:{result.asXML()}")

notes_text="Name:Test•Title: Test•Keywords: A,B,C"
comma_text="A,B,C"

keys={
  "Name": "name",
  "Titel": "title",
  "Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"

comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")

cresult=comma_separated_values.parseString(comma_text)
show_result("comma separated values",cresult)

grammar=delimitedList(
   oneOf(keywords,as_keyword=True)
  +Suppress(":"+Optional(White()))
  +comma_separated_values
  ,delim=runDelim
)("namevalues")

nresult=grammar.parseString(notes_text)
show_result("name value list",nresult)

#ogrammar=OneOrMore(
#   oneOf(keywords,as_keyword=True)
#  +Suppress(":"+Optional(White()))
#  +comma_separated_values
#)
#oresult=grammar.parseString(notes_text)
#show_result("name value list with OneOf",nresult)

output:

result for comma separated values:
  result:['A', 'B', 'C']
  dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
  asDict:{'clist': ['A', 'B', 'C']}
  asList:['A', 'B', 'C']
result for name value list:
  result:['Name', 'Test']
  dump:['Name', 'Test']
- clist: ['Test']
- namevalues: ['Name', 'Test']
  asDict:{'clist': ['Test'], 'namevalues': ['Name', 'Test']}
  asList:['Name', 'Test']

while the first result makes sense for me the second is unintuitive. I’d expected a nested result – a dict with a dict of list.

What causes this unintuitive behavior and how can it be mitigated?

Asked By: Wolfgang Fahl

||

Answers:

For the time being i am using a simple work-around see https://trinket.io/python3/7ccaa91f7e

# Try out parsing name value list
# WF 2023-01-28
import json
notes_text="Name:Test•Title: Test•Keywords: A,B,C"

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}
result={}
key_values=notes_text.split("•")
for key_value in key_values:
  key,value=key_value.split(":")
  value=value.strip()
  result[keys[key]]=value # could do another split here if need be
  
print(json.dumps(result,indent=2))

output:

{
  "name": "Test",
  "title": "Test",
  "keywords": "A,B,C"
}
Answered By: Wolfgang Fahl

Issues with the grammar being that: you are encapsulating OneOrMore in delimited_list and you only want the outer one, and you aren’t telling the parser how your data needs to be structured to give the names meaning.

You also don’t need the whitespace suppression as it is automatic.

Adding parse_all to the parse_string function will help to see where not everything is being consumed.

name_values_grammar = pp.delimited_list(
        pp.Group(
                pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
                + pp.Suppress(pp.Literal(':'))
                + pp.delimited_list(
                    pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
                    , delim=',')
            )
            , delim='•'
        ).setResultsName('tag', list_all_matches=True)

Should i use parse actions? As you can see, you don’t technically need to, but you’ve ended up with a data structure that might be less efficient for what you want. If the grammar gets more complicated, I think using some parse actions would make sense. Take a look below for some examples to map the key names (only if they are found), and cleaning up list parsing for a more complicated grammar.

How is the naming of part results done? By default in a ParseResults object, the last part that is labelled with a name will be returned when you ask for that name. Asking for all matches to be returned using list_all_matches will only work usefully for some simple structures, but it does work. See below for examples.

How is the navigation of the resulting tree done? By default, everything gets flattened. You can use pyparsing.Group to tell the parser not to flatten its contents into the parent list (and therefore retain useful structure and part names).

How is it possible to get the list back from delimitedList? If you don’t wrap the delimited_list result in another list then the flattening that is done will remove the structure. Parse actions or Group on the internal structure again to the rescue.

What does list_all_matches=True achieve – its behavior seems strange It is a function of the grammar structure that it seems strange. Consider the different outputs in:

import pyparsing as pp

print(
    pp.delimited_list(
            pp.Word(pp.printables, exclude_chars=',').setResultsName('word', list_all_matches=True)
        ).parse_string('x,y,z').dump()
    )

print(
    pp.delimited_list(
                pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
                + pp.Suppress(pp.Literal(':'))
                + pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
        )
        .parse_string('x:a,y:b,z:c').dump()
    )

print(
    pp.delimited_list(
        pp.Group(
                pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
                + pp.Suppress(pp.Literal(':'))
                + pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
            )
        ).setResultsName('tag', list_all_matches=True)
        .parse_string('x:a,y:b,z:c').dump()
    )

The first one makes sense, giving you a list of all the tokens you would expect. The third one also makes sense, since you have a structure you can walk. But the second one you end up with two lists that are not necessarily (in a more complicated grammar) going to be easy to match up.

Here’s a different way of building the grammar so that it supports quoting strings with delimiters in them so they don’t become lists, and keywords that aren’t in your mapping. It’s harder to do this without parse actions.

import pyparsing as pp
import json

test_string = "Name:Test•Title: Test•Extra: '1,2,3'•Keywords: A,B,C,'D,E',F"

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}

g_key = pp.Word(pp.alphas)
g_item = pp.Word(pp.printables, excludeChars='•,'') | pp.QuotedString(quote_char="'")
g_value = pp.delimited_list(g_item, delim=',')
l_key_value_sep = pp.Suppress(pp.Literal(':'))
g_key_value = g_key + l_key_value_sep + g_value
g_grammar = pp.delimited_list(g_key_value, delim='•')

g_key.add_parse_action(lambda x: keys[x[0]] if x[0] in keys else x)
g_value.add_parse_action(lambda x: [x] if len(x) > 1 else x)
g_key_value.add_parse_action(lambda x: (x[0], x[1].as_list()) if isinstance(x[1],pp.ParseResults) else (x[0], x[1]))

key_values = dict()
for k,v in g_grammar.parse_string(test_string, parse_all=True):
    key_values[k] = v

print(json.dumps(key_values, indent=2))
Answered By: ricardkelly

Another approach using regular expressions would be:

def _extractByKeyword(keyword: str, string: str) -> typing.Union[str, None]:
    """
    Extract the value for the given key from the given string.
    designed for simple key value strings without further formatting
    e.g.
        Title: Hello World
        Goal: extraction
    For keyword="Goal" the string "extraction would be returned"

    Args:
        keyword: extract the value associated to this keyword
        string: string to extract from

    Returns:
        str: value associated to given keyword
        None: keyword not found in given string
    """
    if string is None or keyword is None:
        return None
    # https://stackoverflow.com/a/2788151/1497139
    # value is closure of not space not / colon
    pattern = rf"({keyword}:(?P<value>[sw,_-]*))(s+w+:|n|$)"
    import re
    match = re.search(pattern, string)
    value = None
    if match is not None:
        value = match.group('value')
        if isinstance(value, str):
            value = value.strip()
    return value

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}

notes_text="Name:Test Title: Test Keywords: A,B,C"

lod = {v: _extractByKeyword(k, notes_text) for k,v in keys.items()}

The extraction function was tested with:

import typing
from dataclasses import dataclass
from unittest import TestCase

class TestExtraction(TestCase)

    def test_extractByKeyword(self):
        """
        tests the keyword extraction
        """
        @dataclass
        class TestParam:
            expected: typing.Union[str, None]
            keyword: typing.Union[str, None]
            string: typing.Union[str, None]

        testParams = [
            TestParam("test", "Goal", "Title:TitlenGoal:testnLabel:title"),
            TestParam("test", "Goal", "Title:TitlenGoal:test Label:title"),
            TestParam("test", "Goal", "Title:TitlenGoal:test"),
            TestParam("test with spaces", "Goal", "Title:TitlenGoal:test with spacesnLabel:title"),
            TestParam("test with spaces", "Goal", "Title:TitlenGoal:test with spaces Label:title"),
            TestParam("test with spaces", "Goal", "Title:TitlenGoal:test with spaces"),
            TestParam("SQL-DML", "Goal", "Title:TitlenGoal:SQL-DML"),
            TestParam("SQL_DML", "Goal", "Title:TitlenGoal:SQL_DML"),
            TestParam(None, None, "Title:TitlenGoal:test"),
            TestParam(None, "Label", None),
            TestParam(None, None, None),
        ]
        for testParam in testParams:
            with self.subTest(testParam=testParam):
                actual = _extractByKeyword(testParam.keyword, testParam.string)
                self.assertEqual(testParam.expected, actual)
Answered By: tholzheim
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.