Converting Python dictionary to YAML file with lists as multiline strings

Question:

I’m trying to convert a Python dictionary of the following form:

{
    "version": "3.1",
    "nlu": [
        {
            "intent": "greet",
            "examples": ["hi", "hello", "howdy"]
        },
        {
            "intent": "goodbye",
            "examples": ["goodbye", "bye", "see you later"]
        }
    ]
 }

to a YAML file of the following form (note the pipes preceding the value associated to each examples key):

version: "3.1"
nlu:
- intent: greet
  examples: |
    - hi
    - hello
    - howdy
- intent: goodbye
  examples: |
    - goodbye
    - bye
    - see you later

Except for needing the pipes (because of Rasa’s training data format specs), I’m familiar with how to accomplish this task using yaml.dump().

What’s the most straightforward way to obtain the format I’m after?

EDIT: Converting the value of each examples key to a string first yields a YAML file which is not at all reader-friendly, especially given that I have many intents comprising many hundreds of total example utterances.

version: '3.1'
nlu:
- intent: greet
  examples: "  - hin  - hellon  - howdyn" 
- intent: goodbye
  examples: "  - goodbyen  - byen  - see you latern"  

I understand that this multi-line format is what the pipe symbol accomplishes, but I’d like to convert it to something more palatable.
Is that possible?

Asked By: vonbecker

||

Answers:

You are asking for the examples value to be represented in your YAML output as a multiline string using the block quote operator (|).

In your Python data, examples is a list of strings, not a multiline string:

{
    "intent": "greet",
    "examples": ["hi", "hello", "howdy"]
},

Of course a Python list will be represented as a YAML list.

If you want it rendered as a block literal value, you need to transform the Python value into a multi-line string ("examples": "- hin- hellon -howdy"), and then you need to configure the yaml module to output strings using the block quote operator.

Something like this:

import yaml

data = {
    "version": "3.1",
    "nlu": [
        {
            "intent": "greet",
            "examples": ["hi", "hello", "howdy"]
        },
        {
            "intent": "goodbye",
            "examples": ["goodbye", "bye", "see you later"]
        }
    ]
 }

def quoted_presenter(dumper, data):
    if 'n' in data:
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    else:
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')

yaml.add_representer(str, quoted_presenter)


for item in data['nlu']:
    item['examples'] = yaml.safe_dump(item['examples'])

print(yaml.dump(data))

This will output:

"nlu":
- "examples": |-
    - hi
    - hello
    - howdy
  "intent": "greet"
- "examples": |-
    - goodbye
    - bye
    - see you later
  "intent": "goodbye"
"version": "3.1"

Yes, this quotes everything (keys as well as values), but that’s about the limits of our granularity using the yaml module. Without the custom representer, we would get instead:

nlu:
- examples: '- hi

    - hello

    - howdy'
  intent: greet
- examples: '- goodbye

    - bye

    - see you later'
  intent: goodbye
version: '3.1'

That’s syntactically identical; just with different formatting.

It’s possible that ruamel.yaml provides more control over the output format.

Answered By: larsks

Neither my ruamel.yaml nor PyYAML do give easy access to context when dumping a scalar. Without
such context you can only render strings differently based on their content and you cannot
determine if a list/sequence is the value for a particular key and dump it in a different way then some other value.

As @larsks already indicated you need to transform the Python list values into a string. I suggest however
to do that before dumping with a recursive function so that you do have the necessary context. In
this case it is possible to do that in place, which is usually the more easy option to implement.
If that is unacceptable (i.e. you need to continue the data structure unmodified after dumping), you
can either first make a copy.deepcopy() of your data, or modify the transform_value to create
that copy and return it (recursively).

ruamel.yaml can round-trip your requested output (specifically preserving the literal scalar as is).
If you would inspect the type of the value for the key examples. You see that it is not a string,
but a ruamel.yaml.scalarstring.LiteralScalarString instance. That instance behaves like a string
in Python, but dumps as a literal scalar.

import sys, io
import ruamel.yaml

data = {
    "version": "3.1",
    "nlu": [
        {
            "intent": "greet",
            "examples": ["hi", "hello", "howdy"]
        },
        {
            "intent": "goodbye",
            "examples": ["goodbye", "bye", "see you later"]
        }
    ]
 }

yaml = ruamel.yaml.YAML()

def literalize_list(v):
    assert isinstance(v, list)
    buf = io.StringIO()
    yaml.dump(v, buf)
    return ruamel.yaml.scalarstring.LiteralScalarString(buf.getvalue())

def transform_value(d, key, transformation):
    """recursively walk over data structure to find key and apply transformation on the value"""
    if isinstance(d, dict):
        for k, v in d.items():
            if k == key:
                d[k] = transformation(v)
            else:
                transform_value(v, key, transformation)
    elif isinstance(d, list):
        for elem in d:
            transform_value(elem, key, transformation)
    

transform_value(data, 'examples', literalize_list)

yaml.dump(data, sys.stdout)

which gives:

version: '3.1'
nlu:
- intent: greet
  examples: |
    - hi
    - hello
    - howdy
- intent: goodbye
  examples: |
    - goodbye
    - bye
    - see you later

The string value 3.1 needs to be quoted, in order not to be loaded as a float. By default this is dumped
as a single quoted scalar (which are easier/quicker to parse in YAML than double quoted scalars).
If you want it dumped with double quotes you can do:

data['version'] = ruamel.yaml.scalarstring.DoubleQuotedScalarString(data['version'])
Answered By: Anthon
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.