How to parse strings to look like sys.argv

Question:

I would like to parse a string like this:

-o 1  --long "Some long string"  

into this:

["-o", "1", "--long", 'Some long string']

or similar.

This is different than either getopt, or optparse, which start with sys.argv parsed input (like the output I have above). Is there a standard way to do this? Basically, this is “splitting” while keeping quoted strings together.

My best function so far:

import csv
def split_quote(string,quotechar='"'):
    '''

    >>> split_quote('--blah "Some argument" here')
    ['--blah', 'Some argument', 'here']

    >>> split_quote("--blah 'Some argument' here", quotechar="'")
    ['--blah', 'Some argument', 'here']
    '''
    s = csv.StringIO(string)
    C = csv.reader(s, delimiter=" ",quotechar=quotechar)
    return list(C)[0]
Asked By: Gregg Lind

||

Answers:

I believe you want the shlex module.

>>> import shlex
>>> shlex.split('-o 1 --long "Some long string"')
['-o', '1', '--long', 'Some long string']
Answered By: Jacob Gabrielson

Before I was aware of shlex.split, I made the following:

import sys

_WORD_DIVIDERS = set((' ', 't', 'r', 'n'))

_QUOTE_CHARS_DICT = {
    '\':   '\',
    ' ':    ' ',
    '"':    '"',
    'r':    'r',
    'n':    'n',
    't':    't',
}

def _raise_type_error():
    raise TypeError("Bytes must be decoded to Unicode first")

def parse_to_argv_gen(instring):
    is_in_quotes = False
    instring_iter = iter(instring)
    join_string = instring[0:0]

    c_list = []
    c = ' '
    while True:
        # Skip whitespace
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if c not in _WORD_DIVIDERS:
                    break
                c = next(instring_iter)
        except StopIteration:
            break
        # Read word
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if not is_in_quotes and c in _WORD_DIVIDERS:
                    break
                if c == '"':
                    is_in_quotes = not is_in_quotes
                    c = None
                elif c == '\':
                    c = next(instring_iter)
                    c = _QUOTE_CHARS_DICT.get(c)
                if c is not None:
                    c_list.append(c)
                c = next(instring_iter)
            yield join_string.join(c_list)
            c_list = []
        except StopIteration:
            yield join_string.join(c_list)
            break

def parse_to_argv(instring):
    return list(parse_to_argv_gen(instring))

This works with Python 2.x and 3.x. On Python 2.x, it works directly with byte strings and Unicode strings. On Python 3.x, it only accepts [Unicode] strings, not bytes objects.

This doesn’t behave exactly the same as shell argv splitting—it also allows quoting of CR, LF and TAB characters as r, n and t, converting them to real CR, LF, TAB (shlex.split doesn’t do that). So writing my own function was useful for my needs. I guess shlex.split is better if you just want plain shell-style argv splitting. I’m sharing this code in case it’s useful as a baseline for doing something slightly different.

Answered By: Craig McQueen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.