Converting key=value pairs back into Python dicts

Question

There’s a logfile with text in the form of space-separated key=value pairs, and each line was originally serialized from data in a Python dict, something like:

' '.join([f'{k}={v!r}' for k,v in d.items()])

The keys are always just strings. The values could be anything that ast.literal_eval can successfully parse, no more no less.

How to process this logfile and turn the lines back into Python dicts? Example:

>>> to_dict("key='hello world'")
{'key': 'hello world'}

>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}

>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}

>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}

Here is some extra context about the data:

Keys are valid names
Input lines are well-formed (e.g. no dangling brackets)
The data is trusted (unsafe functions such as eval, exec, yaml.load are OK to use)
Order is not important. Performance is not important. Correctness is important.

Edit: As requested in the comments, here is an MCVE and an example code that didn’t work correctly

>>> def to_dict(s):
...     s = s.replace(' ', ', ')
...     return eval(f"dict({s})")
... 
... 
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}  # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}  # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'}  # Incorrect, the value was corrupted

Asked By: wim

||

Source

Answer 1

Regex replacement functions to the rescue

I’m not rewriting a ast-like parser for you, but one trick that works pretty well is to use regular expressions to replace the quoted strings and replace them by “variables” (I’ve chosen __token(number)__), a bit like you’re offuscating some code.

Make a note of the strings you’re replacing (that should take care of the spaces), replace space by comma (protecting against symbols before like : allows to pass last test) and replace by strings again.

import re,itertools

def to_dict(s):
    rep_dict = {}
    cnt = itertools.count()
    def rep_func(m):
        rval = "__token{}__".format(next(cnt))
        rep_dict[rval] = m.group(0)
        return rval

    # replaces single/double quoted strings by token variable-like idents
    # going on a limb to support escaped quotes in the string and double escapes at the end of the string
    s = re.sub(r"(['"]).*?([^\]|\\)1",rep_func,s)
    # replaces spaces that follow a letter/digit/underscore by comma
    s = re.sub("(w)s+",r"1,",s)
    #print("debug",s)   # uncomment to see temp string
    # put back the original strings
    s = re.sub("__tokend+__",lambda m : rep_dict[m.group(0)],s)

    return eval("dict({s})".format(s=s))

print(to_dict("k1='v1' k2='v2'"))
print(to_dict("s='1234' n=1234"))
print(to_dict(r"key='hello world'"))
print(to_dict('key="hello world"'))
print(to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""))
# extreme string test
print(to_dict(r"key='hello 'world\'"))

prints:

{'k2': 'v2', 'k1': 'v1'}
{'n': 1234, 's': '1234'}
{'key': 'hello world'}
{'key': 'hello world'}
{'k5': {'k6': ['potato']}, 'k4': 'k5="hello"'}
{'key': "hello 'world\"}

The key is to extract the strings (quoted/double quoted) using non-greedy regex and replace them by non-strings (like if those were string variables not literals) in the expression. The regex has been tuned so it can accept escaped quotes and double escape at the end of string (custom solution)

The replacement function is an inner function so it can make use of the nonlocal dictionary & counter and track the replaced text, so it can be restored once the spaces have been taken care of.

When replacing the spaces by commas, you have to be careful not to do it after a colon (last test) or all things considered after a alphanum/underscore (hence the w protection in the replacement regex for comma)

If we uncomment the debug print code just before the original strings are put back that prints:

debug k1=__token0__,k2=__token1__
debug s=__token0__,n=1234
debug key=__token0__
debug k4=__token0__,k5={__token1__: [__token2__]}
debug key=__token0__

The strings have been pwned, and the replacement of spaces has worked properly. With some more effort, it should probably be possible to quote the keys and replace k1= by "k1": so ast.literal_eval can be used instead of eval (more risky, and not required here)

I’m sure some super-complex expressions can break my code (I’ve even heard that there are very few json parsers able to parse 100% of the valid json files), but for the tests you submitted, it’ll work (of course if some funny guy tries to put __tokenxx__ idents in the original strings, that’ll fail, maybe it could be replaced by some otherwise invalid-as-variable placeholders). I have built an Ada lexer using this technique some time ago to be able to avoid spaces in strings and that worked pretty well.

Answered By: Jean-François Fabre

Answer 2

You can find all the occurrences of = characters, and then find the maximum runs of characters which give a valid ast.literal_eval result. Those characters can then be parsed for the value, associated with a key found by a string slice between the last successful parse and the index of the current =:

import ast, typing
def is_valid(_str:str) -> bool:  
  try:
     _ = ast.literal_eval(_str)
  except:
    return False
  else:
    return True

def parse_line(_d:str) -> typing.Generator[typing.Tuple, None, None]:
  _eq, last = [i for i, a in enumerate(_d) if a == '='], 0
  for _loc in _eq:
     if _loc >= last:
       _key = _d[last:_loc]
       _inner, seen, _running, _worked = _loc+1, '', _loc+2, []
       while True:
         try:
            val = ast.literal_eval(_d[_inner:_running])
         except:
            _running += 1
         else:
            _max = max([i for i in range(len(_d[_inner:])) if is_valid(_d[_inner:_running+i])])
            yield (_key, ast.literal_eval(_d[_inner:_running+_max]))
            last = _running+_max
            break


def to_dict(_d:str) -> dict:
  return dict(parse_line(_d))

print([to_dict("key='hello world'"), 
       to_dict("k1='v1' k2='v2'"), 
       to_dict("s='1234' n=1234"), 
       to_dict("""k4='k5="hello"' k5={'k6': ['potato']}"""),
       to_dict("val=['100', 100, 300]"),
       to_dict("val=[{'t':{32:45}, 'stuff':100, 'extra':[]}, 100, 300]")
   ]

)

Output:

{'key': 'hello world'}
{'k1': 'v1', 'k2': 'v2'}
{'s': '1234', 'n': 1234}
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
{'val': ['100', 100, 300]}
{'val': [{'t': {32: 45}, 'stuff': 100, 'extra': []}, 100, 300]}

Disclaimer:

This solution is not as elegant as @Jean-FrançoisFabre’s, and I am not sure if it can parse 100% of what is passed to to_dict, but it may give you inspiration for your own version.

Answered By: Ajax1234

Answer 3

Your input can’t be conveniently parsed by something like ast.literal_eval, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.

The only place = tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval doesn’t accept anything with = tokens in it. We can use the = tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval. Using the tokenize module also avoids problems with = or backslash escapes in string literals.

import ast
import io
import tokenize

def todict(logstring):
    # tokenize.tokenize wants an argument that acts like the readline method of a binary
    # file-like object, so we have to do some work to give it that.
    input_as_file = io.BytesIO(logstring.encode('utf8'))
    tokens = list(tokenize.tokenize(input_as_file.readline))

    eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']

    names = [tokens[i-1][1] for i in eqsign_locations]

    # Values are harder than keys.
    val_starts = [i+1 for i in eqsign_locations]
    val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]

    # tokenize.untokenize likes to add extra whitespace that ast.literal_eval
    # doesn't like. Removing the row/column information from the token records
    # seems to prevent extra leading whitespace, but the documentation doesn't
    # make enough promises for me to be comfortable with that, so we call
    # strip() as well.
    val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
                   for start, end in zip(val_starts, val_ends)]
    vals = [ast.literal_eval(val_string) for val_string in val_strings]

    return dict(zip(names, vals))

This behaves correctly on your example inputs, as well as on an example with backslashes:

>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"'' c=3
>>> todict(s)
{'a': '=', 'b': '"'', 'c': 3}

Incidentally, we probably could look for token type NAME instead of = tokens, but that’ll break if they ever add set() support to literal_eval. Looking for = could also break in the future, but it doesn’t seem as likely to break as looking for NAME tokens.

Answered By: user2357112

Answer 4

Provide two helper functions.

popstr: split thing from start of string that looks like string
If it starts with a single or double quote mark, I’ll look for the next one and split at that point.
```
def popstr(s):
    i = s[1:].find(s[0]) + 2
    return s[:i], s[i:]
```
poptrt: split thing from start of string that is surrounded by brackets (‘[]’, ‘()’, ‘{}’).
If it starts with a bracket, I’ll start incrementing for every instance of the starting character and decrementing for every instance of it’s complement. When I reach zero, I split.

def poptrt(s):
d = {‘{‘: ‘}’, ‘[‘: ‘]’, ‘(‘: ‘)’}
b = s[0]
c = lambda x: {b: 1, d[b]: -1}.get(x, 0)
parts = []
t, i = 1, 1
while t > 0 and s:
if i > len(s) – 1:
break
elif s[i] in ””‘:
s, s, s = s[:i], *map(str.strip, popstr(s[i:]))
parts.extend([s, s])
i = 0
else:
t += c(s[i])
i += 1
if t == 0:
return ”.join(parts + [s[:i]]), s[i:]
else:
raise ValueError(‘Your string has unbalanced brackets.’)

Chew through string until there is no more string to chew

def to_dict(log):
    d = {}
    while log:
        k, log = map(str.strip, log.split('=', 1))
        if log.startswith(('"', "'")):
            v, log = map(str.strip, popstr(log))
        elif log.startswith((*'{[(',)):
            v, log = map(str.strip, poptrt(log))
        else:
            v, *log = map(str.strip, log.split(None, 1))
            log = ' '.join(log)
        d[k] = ast.literal_eval(v)
    return d

All tests passed

assert to_dict("key='hello world'") == {'key': 'hello world'}
assert to_dict("k1='v1' k2='v2'") == {'k1': 'v1', 'k2': 'v2'}
assert to_dict("s='1234' n=1234") == {'s': '1234', 'n': 1234}
assert to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""") == {'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}

Deficiencies

Did not account for backslashes
Did not account for nested goofy formatting

All Together

import ast

def popstr(s):
    i = s[1:].find(s[0]) + 2
    return s[:i], s[i:]

def poptrt(s):
    d = {'{': '}', '[': ']', '(': ')'}
    b = s[0]
    c = lambda x: {b: 1, d[b]: -1}.get(x, 0)
    parts = []
    t, i = 1, 1
    while t > 0 and s:
        if i > len(s) - 1:
            break
        elif s[i] in ''"':
            _s, s_, s = s[:i], *map(str.strip, popstr(s[i:]))
            parts.extend([_s, s_])
            i = 0
        else:
            t += c(s[i])
            i += 1
    if t == 0:
        return ''.join(parts + [s[:i]]), s[i:]
    else:
        raise ValueError('Your string has unbalanced brackets.')

def to_dict(log):
    d = {}
    while log:
        k, log = map(str.strip, log.split('=', 1))
        if log.startswith(('"', "'")):
            v, log = map(str.strip, popstr(log))
        elif log.startswith((*'{[(',)):
            v, log = map(str.strip, poptrt(log))
        else:
            v, *log = map(str.strip, log.split(None, 1))
            log = ' '.join(log)
        d[k] = ast.literal_eval(v)
    return d

Answered By: piRSquared

Answer 5

I have similar problem to convert 'key1="value1" key2="value2" ...' string into dict. I split string on spaces and create a list of ['key="value"'] pairs. Than in cycle through list again, split pairs on ‘=’ and add pairs to dict.

Code:

str_attr = 'name="Attr1" type="Attr2" use="Attr3"'

list_attr = str_attr.split(' ')
dict_attr = {}
for item in list_attr:
    list_item = item.split('=')
    dict_attr.update({list_item[0] : list_item[1]})
    
print(dict_attr)

result:

{'name': '"Attr1"', 'type': '"Attr2"', 'use': '"Attr3"'}

Limitations:

keys and values should don’t have space (‘ ‘) and/or equal sign (‘=’) inside.

Answered By: Petr Bashkatov