prevent pandas.combine from converting dtypes
Question:
Undesired behavior: pandas.combine
turns ints to floats.
Description:
My DataFrame contains a list of filenames (index) and some metadata about each:
pags rating tms glk
name
file1 original0 1 1 1
file2 original1 2 2 2
file3 original2 3 3 3
file4 original3 4 4 4
file5 original4 5 5 5
Sometimes I need to update some of the columns for some of the files, leaving all other cells unchanged.
Furthermore, the update can contain new files that I need to add as new rows (probably with some N/As).
The update comes in the form of another DataFrame upd
:
pags rating
name
file4 new0 11
file5 new1 12
file6 new2 13
file7 new3 14
Here, I want to change pags
and rating
for files 4,5 and append new rows for files 6,7.
I found I can do this with pd.combine
:
df = df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[df.columns]
pags rating tms glk
name
file1 original0 1.0 1.0 1.0
file2 original1 2.0 2.0 2.0
file3 original2 3.0 3.0 3.0
file4 new0 11.0 4.0 4.0
file5 new1 12.0 5.0 5.0
file6 new2 13.0 NaN NaN
file7 new3 14.0 NaN NaN
The only problem is that all integer columns turned to floating points.
How do I keep the original dtypes
?
I strongly want to avoid manual .astype()
for every column.
Code to create this example:
df = pd.DataFrame({
'name': ['file1','file2','file3','file4','file5'],
'pags': ["original"+str(i) for i in range(5)],
'rating': [1, 2, 3, 4, 5],
'tms': [1, 2, 3, 4, 5],
'glk': [1, 2, 3, 4, 5],
}).set_index('name')
upd = pd.DataFrame({
'name': ['file4','file5','file6','file7'],
'pags': ["new"+str(i) for i in range(4)],
'rating': [11, 12, 13, 14],
}).set_index('name')
df = df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[df.columns]
Answers:
df.astype()
can apply all dtypes at once
so what ended up working in my case was:
self.df = ... # read from disk
upd = ... # get updates
original_dtypes = self.df.dtypes
self.df = self.df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[self.df.columns]
self.df = self.df.apply(...) # fill in the missing data
self.df = self.df.astype(original_dtypes)
Undesired behavior: pandas.combine
turns ints to floats.
Description:
My DataFrame contains a list of filenames (index) and some metadata about each:
pags rating tms glk
name
file1 original0 1 1 1
file2 original1 2 2 2
file3 original2 3 3 3
file4 original3 4 4 4
file5 original4 5 5 5
Sometimes I need to update some of the columns for some of the files, leaving all other cells unchanged.
Furthermore, the update can contain new files that I need to add as new rows (probably with some N/As).
The update comes in the form of another DataFrame upd
:
pags rating
name
file4 new0 11
file5 new1 12
file6 new2 13
file7 new3 14
Here, I want to change pags
and rating
for files 4,5 and append new rows for files 6,7.
I found I can do this with pd.combine
:
df = df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[df.columns]
pags rating tms glk
name
file1 original0 1.0 1.0 1.0
file2 original1 2.0 2.0 2.0
file3 original2 3.0 3.0 3.0
file4 new0 11.0 4.0 4.0
file5 new1 12.0 5.0 5.0
file6 new2 13.0 NaN NaN
file7 new3 14.0 NaN NaN
The only problem is that all integer columns turned to floating points.
How do I keep the original dtypes
?
I strongly want to avoid manual .astype()
for every column.
Code to create this example:
df = pd.DataFrame({
'name': ['file1','file2','file3','file4','file5'],
'pags': ["original"+str(i) for i in range(5)],
'rating': [1, 2, 3, 4, 5],
'tms': [1, 2, 3, 4, 5],
'glk': [1, 2, 3, 4, 5],
}).set_index('name')
upd = pd.DataFrame({
'name': ['file4','file5','file6','file7'],
'pags': ["new"+str(i) for i in range(4)],
'rating': [11, 12, 13, 14],
}).set_index('name')
df = df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[df.columns]
df.astype()
can apply all dtypes at once
so what ended up working in my case was:
self.df = ... # read from disk
upd = ... # get updates
original_dtypes = self.df.dtypes
self.df = self.df.combine(upd, lambda old,new: new.fillna(old), overwrite=False)[self.df.columns]
self.df = self.df.apply(...) # fill in the missing data
self.df = self.df.astype(original_dtypes)