In python how to create multiple dataclasses instances with different objects instance in the fields?

Question:

I’m trying to write a parser and I’m missing something in the dataclasses usage.
I’m trying to be as generic as possible and to do the logic in the parent class but every child has the sames values in the end.
I’m confused with what dataclasse decorator do with class variables and instances variables.
I should probably not use self.__dict__ in my post_init.

How would you do to have unique instances using the same idea ?

from dataclasses import dataclass

class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None

@dataclass
class RecordParser():
    line: str
    def __post_init__(self):
        for k, var in self.__dict__.items():
            if isinstance(var, VarSlice):
                self.__dict__[k].value = self.line[var.slice]

@dataclass
class HeaderRecord(RecordParser):
    sender : VarSlice = VarSlice(3, 8)


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

Result :

45678
45678

Expected result is :

abcde
45678

I tried changing VarSlice to a dataclass too but it changed nothing.

Asked By: Antoine GERVAIL

||

Answers:

This curious behavior is observed, since when you do:

sender: VarSlice = VarSlice(3, 8)

The default value here is a specific instance VarSlice(3, 8) – which is shared between all HeaderRecord instances.

This can be confirmed, by printing the id of the VarSlice object – if they are the same when constructing an instance of a RecordParser subclass more than once, then we have a problem:

if isinstance(var, VarSlice):
    print(id(var))
    ...

This is very likely not what you want.

The desired behavior is likely going to be create a new VarSlice(3, 8) instance, each time a new HeaderRecord object is instantiated.

To resolve the issue, I would suggest to use default_factory instead of default, as this is the recommended (and documented) approach for fields with mutable default values.

i.e.,

sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))

instead of:

sender: VarSlice = VarSlice(3, 8)

The above, being technically equivalent to:

sender: VarSlice = field(default=VarSlice(3, 8))

Full code with example:

from dataclasses import dataclass, field


class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None


@dataclass
class RecordParser:
    line: str

    def __post_init__(self):
        for var in self.__dict__.values():
            if isinstance(var, VarSlice):
                var.value = self.line[var.slice]


@dataclass
class HeaderRecord(RecordParser):
    sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)

Now prints:

defgh
45678

Improving Performance

Though clearly this is not a bottleneck, when creating multiple instances of a RecordParser subclass, I note there could be areas for potential improvement.

Reasons that performance could be (slightly) impacted:

  • There currently exists a for loop on each instantiation to iterate over dataclass fields which are of a specified type VarSlice, where a loop could potentially be avoided.
  • The __dict__ attribute on the instance is accessed each time, which can also be avoided. Note that using dataclasses.fields() instead is actually worse, as this value is not cached on a per-class basis.
  • An isinstance check is run on each dataclass field, each time a subclass is instantiated.

To resolve this, I could suggest improving performance by statically generating a __post__init__() method for the subclass via dataclasses._create_fn() (or copying this logic to avoid dependency on an "internal" function), and setting it on the subclass, i.e. before the @dataclass decorator runs for the subclass.

An easy way could be to utilize the __init_subclass__() hook which runs when a class is subclassed, as shown below.

# to test when annotations are forward-declared (i,e. as strings)
# from __future__ import annotations

from collections import deque
from dataclasses import dataclass, field, _create_fn


class VarSlice:
    def __init__(self, start, end):
        self.slice = slice(start, end)
        self.value = None


@dataclass
class RecordParser:
    line: str

    def __init_subclass__(cls, **kwargs):
        # list containing the (dynamically-generated) body lines of `__post_init__()`
        post_init_lines = deque()
        # loop over class annotations (this is a greatly "simplified"
        # version of how the `dataclasses` module does it)
        for name, tp in cls.__annotations__.items():
            if tp is VarSlice or (isinstance(tp, str) and tp == VarSlice.__name__):
                post_init_lines.append(f'var = self.{name}')
                post_init_lines.append('var.value = line[var.slice]')
        # if there are no dataclass fields of type `VarSlice`, we are done
        if post_init_lines:
            post_init_lines.appendleft('line = self.line')
            cls.__post_init__ = _create_fn('__post_init__', ('self', ), post_init_lines)


@dataclass
class HeaderRecord(RecordParser):
    sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))


k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)
Answered By: rv.kvetch