Combining Dumper class with string representer to get exact required YAML output

Question:

I’m using PyYAML 6.0 with Python 3.9.

In order, I am trying to…

  1. Create a YAML list
  2. Embed this list as a multi-line string in another YAML object
  3. Replace this YAML object in an existing document
  4. Write the document back, in a format that will pass YAML 1.2 linting

I have the process working, apart from the YAML 1.2 requirement, with the following code:

import yaml

def str_presenter(dumper, data):
    """configures yaml for dumping multiline strings
    Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data"""
    if data.count('n') > 0:  # check for multiline string
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)
yaml.representer.SafeRepresenter.add_representer(
    str, str_presenter) 

class DoYamlStuff:
    def post_renderers(images):
        return yaml.dump([
            {
                "op": "replace",
                "path": "/spec/postRenderers",
                "value": [
                    {
                        "kustomize": {
                            "images": images
                        }
                    }
                ]
            }])

    @classmethod
    def images_patch(cls, chart, images, ecr_url):
        return {
            "target": {
                "kind": "HelmRelease",
                "name": chart,
                "namespace": chart
            },
            "patch": cls.post_renderers([x.patch(ecr_url) for x in images])

This produces something like this:

- patch: |
    - op: replace
      path: /spec/postRenderers
      value:
      - kustomize:
          images:
          - name: nginx:latest
            newName: 12345678910.dkr.ecr.eu-west-1.amazonaws.com/nginx
            newTag: latest
  target:
    kind: HelmRelease
    name: nginx
    namespace: nginx

As you can see, that’s mostly working. Valid YAML, does what it needs to, etc.

Unfortunately… it doesn’t indent the list item by 2 spaces, so the YAML linter in our repository’s pre-commit then adjusts everything. Makes the repo messy, and causes PRs to regularly include changes that aren’t relevant.

I then set out to implement this PrettyDumper class from StackOverflow. This reversed the effects – my indentation is now right, but my scalars aren’t working at all:

  - patch: "- op: replacen  path: /spec/postRenderersn  value:n    - kustomize:n
              images:n          - name: nginx:latestn           
       newName: 793961818876.dkr.ecr.eu-west-1.amazonaws.com/nginxn        
          newTag: latestn"
    target:
      kind: HelmRelease
      name: nginx
      namespace: nginx

I have tried to merge the str_presenter function with the PrettyDumper class, but the scalars still don’t work:

import yaml.emitter
import yaml.serializer
import yaml.representer
import yaml.resolver


class IndentingEmitter(yaml.emitter.Emitter):
    def increase_indent(self, flow=False, indentless=False):
        """Ensure that lists items are always indented."""
        return super().increase_indent(
            flow=False,
            indentless=False,
        )


class PrettyDumper(
    IndentingEmitter,
    yaml.serializer.Serializer,
    yaml.representer.Representer,
    yaml.resolver.Resolver,
):
    def __init__(
        self,
        stream,
        default_style=None,
        default_flow_style=False,
        canonical=None,
        indent=None,
        width=None,
        allow_unicode=None,
        line_break=None,
        encoding=None,
        explicit_start=None,
        explicit_end=None,
        version=None,
        tags=None,
        sort_keys=True,
    ):
        IndentingEmitter.__init__(
            self,
            stream,
            canonical=canonical,
            indent=indent,
            width=width,
            allow_unicode=allow_unicode,
            line_break=line_break,
        )
        yaml.serializer.Serializer.__init__(
            self,
            encoding=encoding,
            explicit_start=explicit_start,
            explicit_end=explicit_end,
            version=version,
            tags=tags,
        )
        yaml.representer.Representer.__init__(
            self,
            default_style=default_style,
            default_flow_style=default_flow_style,
            sort_keys=sort_keys,
        )
        yaml.resolver.Resolver.__init__(self)
        
        yaml.add_representer(str, self.str_presenter)
        yaml.representer.SafeRepresenter.add_representer(
            str, self.str_presenter) 

    def str_presenter(self, data):
        print(data)
        """configures yaml for dumping multiline strings
        Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data"""
        if data.count('n') > 0:  # check for multiline string
            return self.represent_scalar('tag:yaml.org,2002:str', data, style='|')
        return self.represent_scalar('tag:yaml.org,2002:str', data)

If I could merge these two approaches into the PrettyDumper class, I think it would do everything I require. Can anyone point me in the right direction?

Asked By: turbonerd

||

Answers:

If you need to pass your output through YAML 1.2 linting, you should not use PyYAML as it only supports (a subset of) YAML 1.1.

ruamel.yaml can handle more, e.g using a sequence as a mapping key, something that PyYAML cannot handle at all, although it is
valid YAML 1.1. Apart from that it supports, and defaults to,
YAML 1.2 loading/dumping (disclaimer: I am the author of that package).

Over the years ruamel.yaml‘s round-trip mode, which was originally built to preserve comments,
has been extended and now
handles superfluous quotes, anchor/alias name preservation,
different format string scalars, integers and float etc. You can use its underlying technology
to easily get what you want, without mucking with representers:

import sys
import io
import ruamel.yaml

images = [
   dict(name='nginx:latest', newName='12345678910.dkr.ecr.eu-west-1.amazonaws.com/nginx', newTag='latest'),
]
chart = 'nginx'

def data_as_literal_scalar(d):
    """dump a data structure d and make it a literal scalar string for further dumping"""
    yaml = ruamel.yaml.YAML()
    yaml.indent(sequence=4, offset=2)  # this indents even the root sequence by 2 extra positions
    buf = io.StringIO()
    yaml.dump(d, buf)
    v = ''.join([x[2:] for x in buf.getvalue().splitlines(True)])  # strip extra positions
    return ruamel.yaml.scalarstring.LiteralScalarString(v)

data = [dict(patch=data_as_literal_scalar([{
                                   "op": "replace",
                                   "path": "/spec/postRenderers",
                                   "value": [
                                       {
                                           "kustomize": {
                                               "images": images
                                           }
                                       }
                                   ]
                                 }]),
    target={
                "kind": "HelmRelease",
                "name": chart,
                "namespace": chart
            },
)]

yaml = ruamel.yaml.YAML()
yaml.dump(data, sys.stdout)

which gives:

- patch: |
    - op: replace
      path: /spec/postRenderers
      value:
        - kustomize:
            images:
              - name: nginx:latest
                newName: 12345678910.dkr.ecr.eu-west-1.amazonaws.com/nginx
                newTag: latest
  target:
    kind: HelmRelease
    name: nginx
    namespace: nginx
Answered By: Anthon
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.