Convert non-tabular, comma-separated data to pydantic
Question:
I have a "special" csv file in the following format:
A;ItemText;1;2
B;1;1.23,99
B;2;9.52,100
C;false
I would like to convert this data into pydantic models.
Currently I subclassed the FieldInfo class:
class CSVFieldInfo(FieldInfo):
def __init__(self, **kwargs: Any):
self.position = kwargs["position"]
if not isinstance(self.position, int):
raise ValueError("Position should be integer, got {}".format(type(self.position)))
super().__init__()
def CSVField(position: int):
return CSVFieldInfo(position=position)
Furthermore, I also subclassed the BaseModel:
class CSVBaseModel(BaseModel):
@classmethod
def from_string(cls, string: str, sep: str=";"):
# no double definitions
l = [x.field_info.position for x in cls.__fields__.values()]
if not len(set(l)) == len(l):
raise ValueError("At least one position is defined twice")
# here i am stuck on how to populate the model correctly (including nested models)
The model layout is then the following:
class CSVTypeA(CSVBaseModel):
record_type: Literal["A"] = CSVField(position=0)
record_text: str = CSVField(position=1)
num: int = CSVField(position=2)
class CSVFile(CSVBaseModel):
a: CSVTypeA
csv_string =
"""A;ItemText;1;2
B;1;1.23,99
B;2;9.52,100
C;false"""
CSVFile.from_string(csv_string)
How can I populate the pydantic model "CSVFile", automatically assigning the right CSV-Line to the correct model (discriminating by field "record_type")?
Answers:
The main problem is that there is no way to know, which type a record is compatible with, without actually looking at the first field in that record.
Since you have no key-value pairs, instead your records are basically just lists of fields without name, we must determine the correct field names after checking, which record type we are dealing with.
This means we need to re-implement some the magic behind Pydantic’s discriminated unions. Fortunately, we can take advantage of the fact that a ModelField
saves a dictionary of discriminator key -> sub-field in its sub_fields_mapping
attribute.
So we can still utilize some of the built-in machinery provided by Pydantic and define our discriminated union properly.
But first we need to define some (exemplary) record types:
record_types.py
from typing import Literal, Union
from pydantic import BaseModel
class CSVLine(BaseModel):
record_type: str
def some_method(self) -> None:
print(self)
class CSVTypeA(CSVLine):
record_type: Literal["A"]
record_text: str
num: int
another_num: int
class CSVTypeB(CSVLine):
record_type: Literal["B"]
num_foo: int
num_floaty: float
num_bar: int
class CSVTypeC(CSVLine):
record_type: Literal["C"]
spam: bool
CSVType = Union[CSVTypeA, CSVTypeB, CSVTypeC]
Next we define the model to represent the actual CSV file as a custom root type, which will be a list
of our discriminated union of record types.
We’ll also define our own class attribute __csv_separator__
to hold the string that we will split records with.
To make dealing with an instance of this model easier and more intuitive, we’ll also define/override a few custom methods for item access, string representation and so on.
Lastly, we’ll need to implement the entire magic of parsing individual lines of a CSV file into instance of the appropriate record type in a custom validator.
csv_model.py
from collections.abc import Iterator
from typing import Annotated, Any, ClassVar
from pydantic import BaseModel, Field, validator
from pydantic.fields import ModelField
from .record_types import CSVType
class CSVFile(BaseModel):
__csv_separator__: ClassVar[str] = ";"
__root__: list[Annotated[CSVType, Field(discriminator="record_type")]]
def __iter__(self) -> Iterator[CSVType]: # type: ignore[override]
yield from self.__root__
def __getitem__(self, item: int) -> CSVType:
return self.__root__[item]
def __str__(self) -> str:
return str(self.__root__)
def __repr__(self) -> str:
return repr(self.__root__)
@validator("__root__", pre=True, each_item=True)
def dict_from_string(cls, v: Any, field: ModelField) -> Any:
if not isinstance(v, str):
return v # let default Pydantic validation take over
record_fields = v.strip().split(cls.__csv_separator__)
discriminator_key = record_fields[0]
assert field.sub_fields_mapping is not None
try: # Determine the model to validate against
type_ = field.sub_fields_mapping[discriminator_key].type_
except KeyError:
raise ValueError(f"{discriminator_key} is not a valid key")
assert issubclass(type_, BaseModel)
field_names = type_.__fields__.keys()
return dict(zip(field_names, record_fields))
That should be all we need.
To create an instance of CSVFile
we just need any iterable of strings (lines in a CSV file). As with all custom root types, we can initialize it either by calling the __init__
method with the __root__
keyword-argument or by passing our iterable of strings to the parse_obj
method.
Demo
csv_string = """
A;ItemText;1;2
B;1;1.23;99
B;2;9.52;100
C;false
""".strip()
obj = CSVFile.parse_obj(csv_string.split("n"))
print(obj[0])
obj[3].some_method()
print(obj.json(indent=4))
Output:
record_type='A' record_text='ItemText' num=1 another_num=2
record_type='C' spam=False
[
{
"record_type": "A",
"record_text": "ItemText",
"num": 1,
"another_num": 2
},
{
"record_type": "B",
"num_foo": 1,
"num_floaty": 1.23,
"num_bar": 99
},
{
"record_type": "B",
"num_foo": 2,
"num_floaty": 9.52,
"num_bar": 100
},
{
"record_type": "C",
"spam": false
}
]
Side note: The reason we need the class variable __csv_separator__
is that the validator is a class method and it needs to know about the separator to use. You could of course write a separate method (as you attempted in your original post) and pass the separator as an argument, then temporarily mutate the class variable and call parse_obj
. But I think it may be easier to just change the separator globally or selectively in a subclass.
Also, I don’t see a reason to explicitly specify the position, so long as the fields of the record type model (and their definition order) match the actual fields of the CSV records.
I have a "special" csv file in the following format:
A;ItemText;1;2
B;1;1.23,99
B;2;9.52,100
C;false
I would like to convert this data into pydantic models.
Currently I subclassed the FieldInfo class:
class CSVFieldInfo(FieldInfo):
def __init__(self, **kwargs: Any):
self.position = kwargs["position"]
if not isinstance(self.position, int):
raise ValueError("Position should be integer, got {}".format(type(self.position)))
super().__init__()
def CSVField(position: int):
return CSVFieldInfo(position=position)
Furthermore, I also subclassed the BaseModel:
class CSVBaseModel(BaseModel):
@classmethod
def from_string(cls, string: str, sep: str=";"):
# no double definitions
l = [x.field_info.position for x in cls.__fields__.values()]
if not len(set(l)) == len(l):
raise ValueError("At least one position is defined twice")
# here i am stuck on how to populate the model correctly (including nested models)
The model layout is then the following:
class CSVTypeA(CSVBaseModel):
record_type: Literal["A"] = CSVField(position=0)
record_text: str = CSVField(position=1)
num: int = CSVField(position=2)
class CSVFile(CSVBaseModel):
a: CSVTypeA
csv_string =
"""A;ItemText;1;2
B;1;1.23,99
B;2;9.52,100
C;false"""
CSVFile.from_string(csv_string)
How can I populate the pydantic model "CSVFile", automatically assigning the right CSV-Line to the correct model (discriminating by field "record_type")?
The main problem is that there is no way to know, which type a record is compatible with, without actually looking at the first field in that record.
Since you have no key-value pairs, instead your records are basically just lists of fields without name, we must determine the correct field names after checking, which record type we are dealing with.
This means we need to re-implement some the magic behind Pydantic’s discriminated unions. Fortunately, we can take advantage of the fact that a ModelField
saves a dictionary of discriminator key -> sub-field in its sub_fields_mapping
attribute.
So we can still utilize some of the built-in machinery provided by Pydantic and define our discriminated union properly.
But first we need to define some (exemplary) record types:
record_types.py
from typing import Literal, Union
from pydantic import BaseModel
class CSVLine(BaseModel):
record_type: str
def some_method(self) -> None:
print(self)
class CSVTypeA(CSVLine):
record_type: Literal["A"]
record_text: str
num: int
another_num: int
class CSVTypeB(CSVLine):
record_type: Literal["B"]
num_foo: int
num_floaty: float
num_bar: int
class CSVTypeC(CSVLine):
record_type: Literal["C"]
spam: bool
CSVType = Union[CSVTypeA, CSVTypeB, CSVTypeC]
Next we define the model to represent the actual CSV file as a custom root type, which will be a list
of our discriminated union of record types.
We’ll also define our own class attribute __csv_separator__
to hold the string that we will split records with.
To make dealing with an instance of this model easier and more intuitive, we’ll also define/override a few custom methods for item access, string representation and so on.
Lastly, we’ll need to implement the entire magic of parsing individual lines of a CSV file into instance of the appropriate record type in a custom validator.
csv_model.py
from collections.abc import Iterator
from typing import Annotated, Any, ClassVar
from pydantic import BaseModel, Field, validator
from pydantic.fields import ModelField
from .record_types import CSVType
class CSVFile(BaseModel):
__csv_separator__: ClassVar[str] = ";"
__root__: list[Annotated[CSVType, Field(discriminator="record_type")]]
def __iter__(self) -> Iterator[CSVType]: # type: ignore[override]
yield from self.__root__
def __getitem__(self, item: int) -> CSVType:
return self.__root__[item]
def __str__(self) -> str:
return str(self.__root__)
def __repr__(self) -> str:
return repr(self.__root__)
@validator("__root__", pre=True, each_item=True)
def dict_from_string(cls, v: Any, field: ModelField) -> Any:
if not isinstance(v, str):
return v # let default Pydantic validation take over
record_fields = v.strip().split(cls.__csv_separator__)
discriminator_key = record_fields[0]
assert field.sub_fields_mapping is not None
try: # Determine the model to validate against
type_ = field.sub_fields_mapping[discriminator_key].type_
except KeyError:
raise ValueError(f"{discriminator_key} is not a valid key")
assert issubclass(type_, BaseModel)
field_names = type_.__fields__.keys()
return dict(zip(field_names, record_fields))
That should be all we need.
To create an instance of CSVFile
we just need any iterable of strings (lines in a CSV file). As with all custom root types, we can initialize it either by calling the __init__
method with the __root__
keyword-argument or by passing our iterable of strings to the parse_obj
method.
Demo
csv_string = """
A;ItemText;1;2
B;1;1.23;99
B;2;9.52;100
C;false
""".strip()
obj = CSVFile.parse_obj(csv_string.split("n"))
print(obj[0])
obj[3].some_method()
print(obj.json(indent=4))
Output:
record_type='A' record_text='ItemText' num=1 another_num=2
record_type='C' spam=False
[
{
"record_type": "A",
"record_text": "ItemText",
"num": 1,
"another_num": 2
},
{
"record_type": "B",
"num_foo": 1,
"num_floaty": 1.23,
"num_bar": 99
},
{
"record_type": "B",
"num_foo": 2,
"num_floaty": 9.52,
"num_bar": 100
},
{
"record_type": "C",
"spam": false
}
]
Side note: The reason we need the class variable __csv_separator__
is that the validator is a class method and it needs to know about the separator to use. You could of course write a separate method (as you attempted in your original post) and pass the separator as an argument, then temporarily mutate the class variable and call parse_obj
. But I think it may be easier to just change the separator globally or selectively in a subclass.
Also, I don’t see a reason to explicitly specify the position, so long as the fields of the record type model (and their definition order) match the actual fields of the CSV records.