How to serialise and deserialise complex POCO data structures in Python to/from JSON

Question:

We have been researching this for hours now, with no luck, there are many ways to serialise and deserialise objects in Python, but we need a simple and standard one that respects typings, for example:

from typings import List, NamedTuple

class Address(object):
    city:str
    postcode:str

class Person(NamedTuple):
    name:str
    addresses:List[Address]

My ask is extremely simple, I am looking for a standard way to convert to and from JSON, without the need to write the serialisation/deserlialisation code for every class, for example:

json = '{ "name": "John", "addresses": [{ "postcode": "EC2 2FA", "city": "London" }, { "city": "Paris", "postcode": "545887", "extra_attribute": "" }]}'

I need a way to:

p= magic(json, Person) # or something similar
print(type(p)) # should print Person
for a in p.addresses:
    print(type(a)) # prints Address
    print(a.city) # should print London then Paris
json2 = unmagic(p)
print(json2 == json) # prints true (probably there will be difference in spacing, but just to clarify the idea)

I have worked in programming for 15 years, and have been using Python for a year, and still not sure what is the best way of very simply serialise/deserialise a structure of POCO objects even after extensive research, I feel dumb.

Edit

Options explored so far have one or more of the following limitations:

  • Depend on the order of elements within the JSON / class definition instead of names of the attributes (the previous example would fail because city and postcode are mixed up).
  • Fail if there are extra details in the JSON (the previous example would fail because there is an extra_attribute).
  • Return dictionary instead of a typed object, or SimpleNamespace, and not an object of the intended type.
  • Require writing serialisation/deserialization code for each and every different class, which is extremely error-prone.
Asked By: Saw

||

Answers:

You can use dataclasses and dacite library for solving this problem. Here’s my example:

 from dataclasses import dataclass, asdict
 from typing import List
 from dacite import from_dict


 @dataclass
 class Address(object):
     city: str
     postcode: str


 @dataclass
 class Person():
     name: str
     addresses: List[Address]

So if you want to serialize the class person you can do:

address1 = Address("London", "EC2 2FA")
address2 = Address("Paris", "545887")

person = Person(name='John', addresses=[address1, address2])
json = asdict(person)
print(json)

Which will print your person information as:

{'name': 'John', 'addresses': [{'city': 'London', 'postcode': 'EC2 2FA'}, {'city': 'Paris', 'postcode': '545887'}]}

Although a native way was preferred, there’s no easy way of accomplishing all the requirements in a simple and native way. Assuming that you don’t want to drop any requirement, the simplest solution I found is using dacite library. It has only one method, from_dict(class, data), which takes care of nested dataclass creation and ignoring extra arguments in the json, among many other things .

person2 = from_dict(Person, json)

This complies with all your requirements:

json = '{ "name": "John", "addresses": [{ "postcode": "EC2 2FA", "city": "London" }, { "city": "Paris", "postcode": "545887", "extra_attribute": "" }]}' 
p = from_dict(Person, json)
print(type(p)) # should print Person
for a in p.addresses:
    print(type(a)) # prints Address
    print(a.city) # should print London then Paris
json2 = asdict(p)
print(json)
print(json2)

Results in:

<class '__main__.Person'>
<class '__main__.Address'>
London
<class '__main__.Address'>
Paris
{'name': 'John', 'addresses': [
    {'postcode': 'EC2 2FA', 'city': 'London'},
    {'city': 'Paris', 'postcode': '545887', 'extra_attribute': ''}
]}
{'name': 'John', 'addresses': [
    {'city': 'London', 'postcode': 'EC2 2FA'}, 
    {'city': 'Paris', 'postcode': '545887'}
]}

Warning: json will not be equal to json2 in this case, since asdict(p) will generate the dict with the elements in declaration order. Nonetheless, objects created using this json2 will have equal values to the objects created with json.

First:

pip install dacite

Second: create dto.py

import logging

from typing import Optional, List, cast
from dataclasses import dataclass
from dacite import from_dict

logging.basicConfig(
    filename='response.log',
    level=logging.INFO,
    format='%(asctime)s %(levelname)-8s [%(filename)s:%(lineno)d:%(process)s] %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
)

SENTINEL = cast(None, object())


@dataclass
class Address:
    city: Optional[str] = SENTINEL
    postcode: Optional[str] = SENTINEL

    def asdict(self):
        return {k: v for k, v in self.__dict__.items() if v is not SENTINEL}


@dataclass
class Person:
    name: Optional[str] = SENTINEL
    addresses: Optional[List[Address]] = SENTINEL

    def asdict(self):
        return {k: v for k, v in self.__dict__.items() if v is not SENTINEL}


if __name__ == '__main__':
    SAMPLE = {
        "name": "John",
        "addresses": [
            {
                "postcode": "EC2 2FA",
                "city": "London"
            },
            {
                "city": "Paris",
                "postcode": "545887",
                "extra_attribute": ""
            }
        ]
    }

    try:
        targetClass = (
            Address
        )

        INFORMATION = from_dict(
            data_class=Person,
            data=SAMPLE
        )

        # TODO: Should be ommited (Just for your questions).
        logging.info(
            " -- type(p): " + str(type(INFORMATION))
        )

        # TODO: Should be ommited (Just for your questions).
        for a in INFORMATION.addresses:
            logging.info(
                " -- type(a): " + str(type(a))
            )

            logging.info(
                " -- a.city: " + str(a.city)
            )

        INFORMATION = INFORMATION.asdict()
        for key, value in INFORMATION.items():
            if isinstance(value, targetClass):
                INFORMATION.update({key: value.asdict()})

            if isinstance(value, list) and value and isinstance(value[0], targetClass):
                INFORMATION.update({key: [v.asdict() for v in value]})
    except Exception as e:
        logging.error(
            'Error: {}'.format(e)
        )
    finally:
        # TODO: Should be ommited (Just for your questions).
        logging.info(
            " -- json: " + str(SAMPLE)
        )

        # TODO: Should be ommited (Just for your questions).
        logging.info(
            " -- json2: " + str(INFORMATION)
        )

        # TODO: Should be ommited (Just for your questions).
        logging.info(
            " -- json2 == json: " + str(INFORMATION == SAMPLE)
        )

Third: see response.log

2021-03-11 12:49:08 INFO     [dto.py:66:42426]  -- type(INFORMATION): <class '__main__.Person'>
2021-03-11 12:49:08 INFO     [dto.py:72:42426]  -- type(a): <class '__main__.Address'>
2021-03-11 12:49:08 INFO     [dto.py:76:42426]  -- a.city: London
2021-03-11 12:49:08 INFO     [dto.py:72:42426]  -- type(a): <class '__main__.Address'>
2021-03-11 12:49:08 INFO     [dto.py:76:42426]  -- a.city: Paris
2021-03-11 12:49:08 INFO     [dto.py:92:42426]  -- json: {'name': 'John', 'addresses': [{'postcode': 'EC2 2FA', 'city': 'London'}, {'city': 'Paris', 'postcode': '545887', 'extra_attribute': ''}]}
2021-03-11 12:49:08 INFO     [dto.py:96:42426]  -- json2: {'name': 'John', 'addresses': [{'city': 'London', 'postcode': 'EC2 2FA'}, {'city': 'Paris', 'postcode': '545887'}]}
2021-03-11 12:49:08 INFO     [dto.py:100:42426]  -- json2 == json: False

I generally use the Marshmallow project to handle JSON serialisation, deserialisation, and validation. When combined with marshmallow-dataclass or, when using SQLAlchemy database models, marshmallow-sqlalchemy, you can produce Marshmallow schemas straight from existing object definitions. You work with instances of the model themselves, so dataclass-defined class instances or SQLAlchemy ORM model instances.

Marshmallow schemas also let you define what happens with extra values in the JSON document; you can ignore these, or throw an exception for them, and vary this per model (models can be nested as needed). You can reuse schemas to subsets of the fields too.

Your small sample model, using marshmallow-dataclass, could be defined as:

import marshmallow
from marshmallow_dataclass import dataclass
from typing import List

class BaseSchema(marshmallow.Schema):
    class Meta:
        unknown = marshmallow.EXCLUDE

@dataclass(base_schema=BaseSchema)
class Address:
    city: str
    postcode: str

@dataclass(base_schema=BaseSchema)
class Person:
    name: str
    addresses: List[Address]

and apart from pip install marshmallow-dataclass before attempting to run the above, that’s it. This example uses an explicit base schema to set the unknown configuration to EXCLUDE, which means: ignore extra attributes in the JSON when loading.

To either deserialize from JSON data, or to serialise to JSON, create an instance of the schema; each dataclass class has a Schema attribute referencing the corresponding (generated) Marshmallow schema object:

>>> schema = Person.Schema()
>>> json = '{ "name": "John", "addresses": [{ "postcode": "EC2 2FA", "city": "London" }, { "city": "Paris", "postcode": "545887", "extra_attribute": "" }]}'
>>> p = schema.loads(json)
>>> p
Person(name='John', addresses=[Address(city='London', postcode='EC2 2FA'), Address(city='Paris', postcode='545887')])
>>> print(type(p)) # should print Person
<class '__main__.Person'>
>>> for a in p.addresses:
...     print(type(a)) # prints Address
...     print(a.city) # should print London then Paris
...
<class '__main__.Address'>
London
<class '__main__.Address'>
Paris
>>> schema.dumps(p)
'{"name": "John", "addresses": [{"postcode": "EC2 2FA", "city": "London"}, {"postcode": "545887", "city": "Paris"}]}'

The Schema.loads() and Schema.dumps() methods accept and produce JSON strings. You can also work with plain Python dictionaries and lists (the types that would be serialisable to JSON using the standard library json module), via Schema.load() and Schema.dump().

For more complex setups you may need to configure the exact validation rules for fields, or exclude some fields from serialisation. You do this with the standard dataclasses.field() function, passing in Marshmallow field options via the metadata argument. marshmallow-dataclass can work out what exact Marshmallow field type to use, but you can always override this. And you can use the NewType() class to define reusable definitions for this; SomeType = NewType("SomeType", python_type, field=MarshmallowField, **field_args) lets you mark dataclass fields as field_name: SomeType in your project.

Marshmallow is, at least for me, the Swiss Army Knife project of serialisation and deserialisation, and there are lots of resources that integrate with Marshmallow. E.g. I’m looking at building several RESTFul APIs for a customer at the moment, and I’ll definitely be using Flask-Smorest to define the API endpoints and generate OpenAPI documentation at the same time. And all I have to do is create the SQLAlchemy models for this, really.

Here is an example Flask RESTful API based on your Person & Address schema, but as SQLALchemy models, served as RESTful API:

# pip install Flask flask-marshmallow flask-smorest flask-sqlalchemy marshmallow-sqlalchemy 

import marshmallow
from flask import Flask
from flask.views import MethodView
from flask_marshmallow import Marshmallow
from flask_smorest import Api, Blueprint, abort
from flask_sqlalchemy import SQLAlchemy

app = Flask(__name__)
app.config['API_TITLE'] = 'ContactBook'
app.config['API_VERSION'] = 'v1'
app.config['OPENAPI_VERSION'] = '3.0.3'
app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///:memory:'
api = Api(app)
db = SQLAlchemy(app)
ma = Marshmallow(app)

class Address(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    city = db.Column(db.String)
    postcode = db.Column(db.String)
    person_id = db.Column(db.Integer, db.ForeignKey('person.id'), nullable=False)

class Person(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String)
    addresses = db.relationship('Address', backref='person', lazy=True)

# create tables in the (in-memory, temporary) database
db.create_all()

class BaseSQLAlchemyAutoSchema(ma.SQLAlchemyAutoSchema):
    def update(self, instance, **data):
        for fname in self.fields:
            if fname not in data:
                continue
            setattr(instance, fname, data.get(fname))

class AddressSchema(BaseSQLAlchemyAutoSchema):
    class Meta:
        table = Address.__table__

class PersonSchema(BaseSQLAlchemyAutoSchema):
    class Meta:
        table = Person.__table__

    addresses = ma.List(ma.Nested(AddressSchema(unknown=marshmallow.EXCLUDE)))

class PersonQueryArgsSchema(ma.Schema):
    name = ma.String()
    city = ma.String()

blp = Blueprint(
    "people", "people", url_prefix="/people", description="Operations on people"
)

@blp.route("/")
class People(MethodView):
    @blp.arguments(PersonQueryArgsSchema, location="query")
    @blp.response(200, PersonSchema(many=True))
    def get(self, args):
        """List people"""
        query = Person.query
        if args.get("name"):
            query = query.filter(Person.name == args["name"])
        if args.get("city"):
            query = query.filter(Person.addresses.any(Address.city == args["city"]))
        return query

    @blp.arguments(PersonSchema(unknown=marshmallow.EXCLUDE))
    @blp.response(201, PersonSchema)
    def post(self, new_person):
        """Add a new person"""
        addresses = new_person.pop("addresses", ())
        person = Person(**new_person)
        for address in addresses:
            person.addresses.append(Address(**address))
        db.session.add(person)
        db.session.commit()
        return person

@blp.route("/<person_id>")
class PersonById(MethodView):
    @blp.response(200, PersonSchema)
    def get(self, person_id):
        """Get person by ID"""
        return Person.query.get_or_404(person_id)

    @blp.arguments(PersonSchema(unknown=marshmallow.EXCLUDE, exclude=('addresses',)))
    @blp.response(200, PersonSchema)
    def put(self, updated_person_data, person_id):
        """Update existing person"""
        person = Person.query.get_or_404(person_id)
        PersonSchema().update(person, **updated_person_data)
        db.session.commit()
        return person

    @blp.response(204)
    def delete(self, person_id):
        """Delete person"""
        db.session.delete(Person.query.get_or_404(person_id))

api.register_blueprint(blp)

Voila, full-featured REST API that lets us list, updated, created and deleted Person entries.

Answered By: Martijn Pieters

You can use the builtin dataclasses module, along with a preferred (de)serialization library such as the dataclass-wizard, in order to achieve the desired results.

First, start off by defining the class model or schema, using the @dataclass decorator:

from __future__ import annotations  # can be removed in PY 3.9+

from dataclasses import dataclass


@dataclass
class Address:
    city: str
    postcode: str


@dataclass
class Person:
    name: str
    addresses: list[Address]

Then, install any desired (third-party) libraries:

pip install dacite dataclass-wizard

Adding a quick test, in Python code:

from dataclass_wizard import fromdict, asdict

json_dict = {
    "name": "John",
    "addresses": [{"postcode": "EC2 2FA", "city": "London"},
                  {"city": "Paris", "postcode": "545887", "extra_attribute": ""}]
}

p = fromdict(Person, json_dict)  # or something similar
print(type(p))  # should print Person
for a in p.addresses:
    print(type(a))  # prints Address
    print(a.city)  # should print London then Paris
json_dict2 = asdict(p)

# removes extra data, since that throws off comparison
json_dict['addresses'][-1].pop('extra_attribute')

print(json_dict2 == json_dict)  # prints true

Output:

<class '__main__.Person'>
<class '__main__.Address'>
London
<class '__main__.Address'>
Paris
True

Measuring Performance

Here’s a quick test using the timeit module to measure (de)serialization times against the dacite and dataclasses library. A fun fact, serialization is slightly faster than the builtin asdict helper function 🙂

from timeit import timeit

import dacite
import dataclasses
import dataclass_wizard

json_dict = {"name": "John",
             "addresses": [{"postcode": "EC2 2FA", "city": "London"},
                           {"city": "Paris", "postcode": "545887", "extra_attribute": ""}]}

n = 10_000
print(f'dataclass_wizard.fromdict:   {timeit("dataclass_wizard.fromdict(Person, json_dict)", globals=globals(), number=n):.3f}')
print(f'dacite.from_dict:            {timeit("dacite.from_dict(Person, json_dict)", globals=globals(), number=n):.3f}')

p1 = dataclass_wizard.fromdict(Person, json_dict)  # or something similar
p2 = dacite.from_dict(Person, json_dict)  # or something similar

assert p1 == p2

print(f'dataclass_wizard.asdict:  {timeit("dataclass_wizard.asdict(p2)", globals=globals(), number=n):.3f}')
print(f'dataclasses.asdict:       {timeit("dataclasses.asdict(p2)", globals=globals(), number=n):.3f}')

Results on my M1 Mac:

dataclass_wizard.fromdict:   0.025
dacite.from_dict:            0.972
dataclass_wizard.asdict:  0.028
dataclasses.asdict:       0.054
Answered By: rv.kvetch
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.