Functional python design with dataclasses, pandas and inheritance

Question:

I’ve written a python application in a ‘broadly’ functional way, using frozen dataclasses as the inputs and outputs of functions. These dataclasses typically hold a dataframe, and perhaps another attribute, for example:

@dataclass(frozen=True)
class TimeSeries:
    log: pd.DataFrame
    sourceName: str

I now have more possible data objects, which follow an ‘as-a’ inheritance structure. So perhaps a TimeSeries has DataFrame with columns only Time and A, and a ExtendedTimeSeries has one with these columns and also a B column, and so on. I now have 4 different TimeSeries which in an OO paradigm would fall into a hierarchy.

What is the best structure for this?

I could use (OO style) composition rather than inheritance, and have the ExtendedTimeSeries data structure contain a TimeSeries object and a standalone Temperature series, but that doesn’t seem to be efficient (have to merge before doing df operations) or safe (possibility of mismatched rows).

Without the DataFrames this compositional approach would seem to work ok. Any good design tips?

I could have a series of dataclasses inheriting from each other, but they would have exactly the same variables (in the example above log and sourceName), and I’m not sure that is possible/sensible.

Asked By: Olivia Sprogget

||

Answers:

In this scenario I would discriminate the cases with a src_type attribute, which then can be used to identify the type of data. This src_type could be automatically determined in a __post_init__ method (circumventing the frozen status) and then used in the functional evaluation.

from enum import Enum
from dataclasses import dataclass

import pandas as pd


# predefined source types for easier discrimination
class SrcType(Enum):
    STANDARD = 0
    EXTENDED = 1


@dataclass(frozen=True)
class TimeSeries:
    log: pd.DataFrame
    src_name: str
    src_type: SrcType = None

    def __post_init__(self):
        # criteria for various source types
        if 'B' in self.log.columns:
            src_type = SrcType.EXTENDED
        else:
            src_type = SrcType.STANDARD
        # bypassing the frozen attribute
        object.__setattr__(self, 'src_type', src_type)


series = TimeSeries(pd.DataFrame(), "my_src")
print(series.src_type)  # <- STANDARD
series = TimeSeries(pd.DataFrame({'B': [0]}), "my_src")
print(series.src_type)  # <- EXTENDED
Answered By: Christian Karcher
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.