Convert non-tabular, comma-separated data to pydantic

Question:

I have a "special" csv file in the following format:

A;ItemText;1;2
B;1;1.23,99
B;2;9.52,100
C;false

I would like to convert this data into pydantic models.

Currently I subclassed the FieldInfo class:

class CSVFieldInfo(FieldInfo):
    
    def __init__(self, **kwargs: Any):

        self.position = kwargs["position"]
        
        if not isinstance(self.position, int):
            raise ValueError("Position should be integer, got {}".format(type(self.position)))

        super().__init__()

def CSVField(position: int):
    return CSVFieldInfo(position=position)

Furthermore, I also subclassed the BaseModel:

class CSVBaseModel(BaseModel):
    
    @classmethod
    def from_string(cls, string: str, sep: str=";"):
        
        # no double definitions
        l = [x.field_info.position for x in cls.__fields__.values()]
        if not len(set(l)) == len(l):
            raise ValueError("At least one position is defined twice")
        
        # here i am stuck on how to populate the model correctly (including nested models)

The model layout is then the following:

class CSVTypeA(CSVBaseModel):
    record_type: Literal["A"] = CSVField(position=0)
    record_text: str = CSVField(position=1)
    num: int = CSVField(position=2)

class CSVFile(CSVBaseModel):
    a: CSVTypeA

csv_string = 
"""A;ItemText;1;2
B;1;1.23,99
B;2;9.52,100
C;false"""

CSVFile.from_string(csv_string)

How can I populate the pydantic model "CSVFile", automatically assigning the right CSV-Line to the correct model (discriminating by field "record_type")?

Asked By: Karl

||

Answers:

The main problem is that there is no way to know, which type a record is compatible with, without actually looking at the first field in that record.

Since you have no key-value pairs, instead your records are basically just lists of fields without name, we must determine the correct field names after checking, which record type we are dealing with.

This means we need to re-implement some the magic behind Pydantic’s discriminated unions. Fortunately, we can take advantage of the fact that a ModelField saves a dictionary of discriminator key -> sub-field in its sub_fields_mapping attribute.

So we can still utilize some of the built-in machinery provided by Pydantic and define our discriminated union properly.

But first we need to define some (exemplary) record types:

record_types.py

from typing import Literal, Union

from pydantic import BaseModel


class CSVLine(BaseModel):
    record_type: str

    def some_method(self) -> None:
        print(self)


class CSVTypeA(CSVLine):
    record_type: Literal["A"]
    record_text: str
    num: int
    another_num: int


class CSVTypeB(CSVLine):
    record_type: Literal["B"]
    num_foo: int
    num_floaty: float
    num_bar: int


class CSVTypeC(CSVLine):
    record_type: Literal["C"]
    spam: bool


CSVType = Union[CSVTypeA, CSVTypeB, CSVTypeC]

Next we define the model to represent the actual CSV file as a custom root type, which will be a list of our discriminated union of record types.

We’ll also define our own class attribute __csv_separator__ to hold the string that we will split records with.

To make dealing with an instance of this model easier and more intuitive, we’ll also define/override a few custom methods for item access, string representation and so on.

Lastly, we’ll need to implement the entire magic of parsing individual lines of a CSV file into instance of the appropriate record type in a custom validator.

csv_model.py

from collections.abc import Iterator
from typing import Annotated, Any, ClassVar

from pydantic import BaseModel, Field, validator
from pydantic.fields import ModelField

from .record_types import CSVType


class CSVFile(BaseModel):
    __csv_separator__: ClassVar[str] = ";"
    __root__: list[Annotated[CSVType, Field(discriminator="record_type")]]

    def __iter__(self) -> Iterator[CSVType]:  # type: ignore[override]
        yield from self.__root__

    def __getitem__(self, item: int) -> CSVType:
        return self.__root__[item]

    def __str__(self) -> str:
        return str(self.__root__)

    def __repr__(self) -> str:
        return repr(self.__root__)

    @validator("__root__", pre=True, each_item=True)
    def dict_from_string(cls, v: Any, field: ModelField) -> Any:
        if not isinstance(v, str):
            return v  # let default Pydantic validation take over
        record_fields = v.strip().split(cls.__csv_separator__)
        discriminator_key = record_fields[0]
        assert field.sub_fields_mapping is not None
        try:  # Determine the model to validate against
            type_ = field.sub_fields_mapping[discriminator_key].type_
        except KeyError:
            raise ValueError(f"{discriminator_key} is not a valid key")
        assert issubclass(type_, BaseModel)
        field_names = type_.__fields__.keys()
        return dict(zip(field_names, record_fields))

That should be all we need.

To create an instance of CSVFile we just need any iterable of strings (lines in a CSV file). As with all custom root types, we can initialize it either by calling the __init__ method with the __root__ keyword-argument or by passing our iterable of strings to the parse_obj method.

Demo

csv_string = """
A;ItemText;1;2
B;1;1.23;99
B;2;9.52;100
C;false
""".strip()

obj = CSVFile.parse_obj(csv_string.split("n"))
print(obj[0])
obj[3].some_method()
print(obj.json(indent=4))

Output:

record_type='A' record_text='ItemText' num=1 another_num=2
record_type='C' spam=False
[
    {
        "record_type": "A",
        "record_text": "ItemText",
        "num": 1,
        "another_num": 2
    },
    {
        "record_type": "B",
        "num_foo": 1,
        "num_floaty": 1.23,
        "num_bar": 99
    },
    {
        "record_type": "B",
        "num_foo": 2,
        "num_floaty": 9.52,
        "num_bar": 100
    },
    {
        "record_type": "C",
        "spam": false
    }
]

Side note: The reason we need the class variable __csv_separator__ is that the validator is a class method and it needs to know about the separator to use. You could of course write a separate method (as you attempted in your original post) and pass the separator as an argument, then temporarily mutate the class variable and call parse_obj. But I think it may be easier to just change the separator globally or selectively in a subclass.

Also, I don’t see a reason to explicitly specify the position, so long as the fields of the record type model (and their definition order) match the actual fields of the CSV records.

Answered By: Daniil Fajnberg
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.