Parsing only some lines with pyparsing

Question:

I’m trying to parse a file, actually some portions of the file. The file contains information about hardwares in a server and each line starts with a keyword denoting the type of hardware. For example:

pci24 u2480-L0
fcs1 g4045-L1
pci25 h6045-L0
en192 v7024-L3
pci26 h6045-L1

Above example doesnt show a real file but it’s simple and quite enough to demonstrate the need. I want only to parse the lines starting with "pci" and skip others. I wrote a grammer for lines starting with "pci":

grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )

I’ve also wrote a grammar for lines not starting with "pci":

grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )

And then build a grammar that sum up above two:

grammar = ( grammar_pci | grammar_non_pci )

Then i read the file and send it to parseString:

with open("foo.txt","r") as f:
  data = grammar.parseString(f.read())
print(data)

But no data is written as output. What am i missing? How to parse data skipping the lines not starts with a specific keyword?

Thanks.

Asked By: Jai

||

Answers:

Read each line at a time, and if starts with pci, add it to the list data; otherwise, discard it:

data = []

with open("foo.txt", "r") as f:
    for line in f:
        if line.startswith('pci'):
            data.append(line)

print(data)

If you still need to do further parsing with your Grammar, you can now parse the list data, knowing that each item does indeed start with pci.

Answered By: Danielle M.

You are off to a good start, but you are missing a few steps, mostly having to do with filling in gaps and repetition.

First, look at your expression for grammar_non_pci:

grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )

This correctly detects a line that does not start with "pci", but it doesn’t actually parse the line’s content.

The easiest way to add this is to add a ".*" to the regex, so that it will parse not only the "not starting with pci" lookahead, but also the rest of the line.

grammar_non_pci = Suppress( Regex( r"(?!pci).*" ) )

Second, your grammar just processes a single instance of an input line.

grammar = ( grammar_pci | grammar_non_pci )

grammar needs to be repetitive

grammar = OneOrMore( grammar_pci | grammar_non_pci, stopOn=StringEnd())

[EDIT: since you are up to pyparsing 3.0.9, this can also be written as follows]
grammar = (grammar_pci | grammar_non_pci)[1, ...: StringEnd()]

Since grammar_non_pci could actually match on an empty string, it could repeat forever at the end of the file – that’s why the stopOn argument is needed.

With these changes, your sample text should parse correctly.

But there is one issue that you’ll need to clean up, and that is the definition of the "pci"-prefixed word in grammar_pci.

grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )

Pyparsing’s Word class takes 1 or 2 strings of characters, and uses them as a set of the valid characters for the initial word character and the body word characters. "pci" + nums gives the string "pci0123456789", and will match any word group using any of those characters. So it will match not only "pci00" but also "cip123", "cci123", "p0c0i", or "12345".

To resolve this, use "pci" + Word(nums) wrapped in Combine to represent only word groups that start with "pci":

grammar_pci = Group ( Combine("pci" + Word( nums )) + Word( alphanums + "-" ) )

Since you seem comfortable using Regex items, you could also write this as

grammar_pci = Group ( Regex(r"pcid+") + Word( alphanums + "-" ) )

These changes should get you moving forward on your parser.

Answered By: PaulMcG
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.