How to convert list of model objects to pandas dataframe?
Question:
I have an array of objects of this class
class CancerDataEntity(Model):
age = columns.Text(primary_key=True)
gender = columns.Text(primary_key=True)
cancer = columns.Text(primary_key=True)
deaths = columns.Integer()
...
When printed, array looks like this
[CancerDataEntity(age=u'80-85+', gender=u'Female', cancer=u'All cancers (C00-97,B21)', deaths=15306), CancerDataEntity(...
I want to convert this to a data frame so I can play with it in a more suitable way to me – to aggregate, count, sum and similar.
How I wish this data frame to look, would be something like this:
age gender cancer deaths
0 80-85+ Female ... 15306
1 ...
Is there a way to achieve this using numpy/pandas easily, without manually processing the input array?
Answers:
try:
variables = list(array[0].keys())
dataframe = pandas.DataFrame([[getattr(i,j) for j in variables] for i in array], columns = variables)
Code that leads to desired result:
variables = arr[0].keys()
df = pd.DataFrame([[getattr(i,j) for j in variables] for i in arr], columns = variables)
Thanks to @Serbitar for pointing me to the right direction.
A much cleaner way to to this is to define a to_dict
method on your class and then use pandas.DataFrame.from_records
class Signal(object):
def __init__(self, x, y):
self.x = x
self.y = y
def to_dict(self):
return {
'x': self.x,
'y': self.y,
}
e.g.
In [87]: signals = [Signal(3, 9), Signal(4, 16)]
In [88]: pandas.DataFrame.from_records([s.to_dict() for s in signals])
Out[88]:
x y
0 3 9
1 4 16
Just use:
DataFrame([o.__dict__ for o in my_objs])
Full example:
import pandas as pd
# define some class
class SomeThing:
def __init__(self, x, y):
self.x, self.y = x, y
# make an array of the class objects
things = [SomeThing(1,2), SomeThing(3,4), SomeThing(4,5)]
# fill dataframe with one row per object, one attribute per column
df = pd.DataFrame([t.__dict__ for t in things ])
print(df)
This prints:
x y
0 1 2
1 3 4
2 4 5
I would like to emphasize Jim Hunziker‘s comment.
pandas.DataFrame([vars(s) for s in signals])
It is far easier to write, less error-prone and you don’t have to change the to_dict()
function every time you add a new attribute.
If you want the freedom to choose which attributes to keep, the columns parameter could be used.
pandas.DataFrame([vars(s) for s in signals], columns=['x', 'y'])
The downside is that it won’t work for complex attributes, though that should rarely be the case.
For anyone working with Python3.7+ dataclasses
, this can be done very elegantly using built-in asdict
; based on OregonTrail’s example:
from dataclasses import dataclass, asdict
@dataclass
class Signal:
x: float
y: float
signals = [Signal(3, 9), Signal(4, 16)]
pandas.DataFrame.from_records([asdict(s) for s in signals])
This yields the correct DataFrame
without the need for any custom methods, dunder methods, barebones vars
nor getattr
:
x y
0 3 9
1 4 16
I have an array of objects of this class
class CancerDataEntity(Model):
age = columns.Text(primary_key=True)
gender = columns.Text(primary_key=True)
cancer = columns.Text(primary_key=True)
deaths = columns.Integer()
...
When printed, array looks like this
[CancerDataEntity(age=u'80-85+', gender=u'Female', cancer=u'All cancers (C00-97,B21)', deaths=15306), CancerDataEntity(...
I want to convert this to a data frame so I can play with it in a more suitable way to me – to aggregate, count, sum and similar.
How I wish this data frame to look, would be something like this:
age gender cancer deaths
0 80-85+ Female ... 15306
1 ...
Is there a way to achieve this using numpy/pandas easily, without manually processing the input array?
try:
variables = list(array[0].keys())
dataframe = pandas.DataFrame([[getattr(i,j) for j in variables] for i in array], columns = variables)
Code that leads to desired result:
variables = arr[0].keys()
df = pd.DataFrame([[getattr(i,j) for j in variables] for i in arr], columns = variables)
Thanks to @Serbitar for pointing me to the right direction.
A much cleaner way to to this is to define a to_dict
method on your class and then use pandas.DataFrame.from_records
class Signal(object):
def __init__(self, x, y):
self.x = x
self.y = y
def to_dict(self):
return {
'x': self.x,
'y': self.y,
}
e.g.
In [87]: signals = [Signal(3, 9), Signal(4, 16)]
In [88]: pandas.DataFrame.from_records([s.to_dict() for s in signals])
Out[88]:
x y
0 3 9
1 4 16
Just use:
DataFrame([o.__dict__ for o in my_objs])
Full example:
import pandas as pd
# define some class
class SomeThing:
def __init__(self, x, y):
self.x, self.y = x, y
# make an array of the class objects
things = [SomeThing(1,2), SomeThing(3,4), SomeThing(4,5)]
# fill dataframe with one row per object, one attribute per column
df = pd.DataFrame([t.__dict__ for t in things ])
print(df)
This prints:
x y
0 1 2
1 3 4
2 4 5
I would like to emphasize Jim Hunziker‘s comment.
pandas.DataFrame([vars(s) for s in signals])
It is far easier to write, less error-prone and you don’t have to change the to_dict()
function every time you add a new attribute.
If you want the freedom to choose which attributes to keep, the columns parameter could be used.
pandas.DataFrame([vars(s) for s in signals], columns=['x', 'y'])
The downside is that it won’t work for complex attributes, though that should rarely be the case.
For anyone working with Python3.7+ dataclasses
, this can be done very elegantly using built-in asdict
; based on OregonTrail’s example:
from dataclasses import dataclass, asdict
@dataclass
class Signal:
x: float
y: float
signals = [Signal(3, 9), Signal(4, 16)]
pandas.DataFrame.from_records([asdict(s) for s in signals])
This yields the correct DataFrame
without the need for any custom methods, dunder methods, barebones vars
nor getattr
:
x y
0 3 9
1 4 16