Python pandas: output dataframe to csv with integers
Question:
I have a pandas.DataFrame
that I wish to export to a CSV file. However, pandas seems to write some of the values as float
instead of int
types. I couldn’t not find how to change this behavior.
Building a data frame:
df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'], dtype=int)
x = pandas.Series([10,10,10], index=['a','b','d'], dtype=int)
y = pandas.Series([1,5,2,3], index=['a','b','c','d'], dtype=int)
z = pandas.Series([1,2,3,4], index=['a','b','c','d'], dtype=int)
df.loc['x']=x; df.loc['y']=y; df.loc['z']=z
View it:
>>> df
a b c d
x 10 10 NaN 10
y 1 5 2 3
z 1 2 3 4
Export it:
>>> df.to_csv('test.csv', sep='t', na_rep='0', dtype=int)
>>> for l in open('test.csv'): print l.strip('n')
a b c d
x 10.0 10.0 0 10.0
y 1 5 2 3
z 1 2 3 4
Why do the tens have a dot zero ?
Sure, I could just stick this function into my pipeline to reconvert the whole CSV file, but it seems unnecessary:
def lines_as_integer(path):
handle = open(path)
yield handle.next()
for line in handle:
line = line.split()
label = line[0]
values = map(float, line[1:])
values = map(int, values)
yield label + 't' + 't'.join(map(str,values)) + 'n'
handle = open(path_table_int, 'w')
handle.writelines(lines_as_integer(path_table_float))
handle.close()
Answers:
This is a “gotcha” in pandas (Support for integer NA), where integer columns with NaNs are converted to floats.
This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use dtype=object
arrays instead.
The problem is that since you are assigning things by rows, but dtypes are grouped by columns, so things get cast to object
dtype, which is not a good thing, you lose all efficiency. So one way is to convert which will coerce to float/int dtype as needed.
As we answered in another question, if you construct the frame all at once (or construct column by column) this step will not be needed
In [23]: def convert(x):
....: try:
....: return x.astype(int)
....: except:
....: return x
....:
In [24]: df.apply(convert)
Out[24]:
a b c d
x 10 10 NaN 10
y 1 5 2 3
z 1 2 3 4
In [25]: df.apply(convert).dtypes
Out[25]:
a int64
b int64
c float64
d int64
dtype: object
In [26]: df.apply(convert).to_csv('test.csv')
In [27]: !cat test.csv
,a,b,c,d
x,10,10,,10
y,1,5,2.0,3
z,1,2,3.0,4
The answer I was looking for was a slight variation of what @Jeff proposed in his answer. The credit goes to him. This is what solved my problem in the end for reference:
import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])
df = df.fillna(0)
df = df.astype(int)
df.to_csv('test.csv', sep='t')
If you want to preserve NaN info in the csv which you have exported, then do the below.
P.S : I’m concentrating on column ‘C’ in this case.
df[c] = df[c].fillna('') #filling Nan with empty string
df[c] = df[c].astype(str) #convert the column to string
>>> df
a b c d
x 10 10 10
y 1 5 2.0 3
z 1 2 3.0 4
df[c] = df[c].str.split('.') #split the float value into list based on '.'
>>> df
a b c d
x 10 10 [''] 10
y 1 5 ['2','0'] 3
z 1 2 ['3','0'] 4
df[c] = df[c].str[0] #select 1st element from the list
>>> df
a b c d
x 10 10 10
y 1 5 2 3
z 1 2 3 4
Now, if you export the dataframe to csv, Column ‘c’ will not have float values and the NaN info is preserved.
You can use astype() to specify data type for each column
For example:
import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])
df = df.astype({"a": int, "b": complex, "c" : float, "d" : int})
You can change your DataFrame into Numpy array as a workaround:
np.savetxt(savepath, np.array(df).astype(np.int), fmt='%i', delimiter = ',', header= 'PassengerId,Survived', comments='')
Just write it out as string to csv:
df.to_csv('test.csv', sep='t', na_rep='0', dtype=str)
The simplest solution is to use float_format
in pd.read_csv()
:
df.to_csv('test.csv', sep='t', na_rep=0, float_format='%.0f')
But this applies to all float columns. BTW: Using your code on pandas 1.1.5, all of my columns are float.
Output:
a b c d
x 10 10 0 10
y 1 5 2 3
z 1 2 3 4
Without float_format
:
a b c d
x 10.0 10.0 0 10.0
y 1.0 5.0 2.0 3.0
z 1.0 2.0 3.0 4.0
Here is yet another solution:
df['IntColumnWithNAValues'].fillna(0, inplace=True) #Fill with a value that is out of your range
df['IntColumnWithNAValues'] = df['IntColumnWithNAValues'].astype(int)
df['IntColumnWithNAValues'].replace(0, '', inplace=True)
.csv files doesn’t differentiate between NA or ” (empty string) as it as a text file, so you get to keep your missing fields while converting non null values to int.
You can do this for every column that you want; If you have lots of columns it might be a problem.
I have a pandas.DataFrame
that I wish to export to a CSV file. However, pandas seems to write some of the values as float
instead of int
types. I couldn’t not find how to change this behavior.
Building a data frame:
df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'], dtype=int)
x = pandas.Series([10,10,10], index=['a','b','d'], dtype=int)
y = pandas.Series([1,5,2,3], index=['a','b','c','d'], dtype=int)
z = pandas.Series([1,2,3,4], index=['a','b','c','d'], dtype=int)
df.loc['x']=x; df.loc['y']=y; df.loc['z']=z
View it:
>>> df
a b c d
x 10 10 NaN 10
y 1 5 2 3
z 1 2 3 4
Export it:
>>> df.to_csv('test.csv', sep='t', na_rep='0', dtype=int)
>>> for l in open('test.csv'): print l.strip('n')
a b c d
x 10.0 10.0 0 10.0
y 1 5 2 3
z 1 2 3 4
Why do the tens have a dot zero ?
Sure, I could just stick this function into my pipeline to reconvert the whole CSV file, but it seems unnecessary:
def lines_as_integer(path):
handle = open(path)
yield handle.next()
for line in handle:
line = line.split()
label = line[0]
values = map(float, line[1:])
values = map(int, values)
yield label + 't' + 't'.join(map(str,values)) + 'n'
handle = open(path_table_int, 'w')
handle.writelines(lines_as_integer(path_table_float))
handle.close()
This is a “gotcha” in pandas (Support for integer NA), where integer columns with NaNs are converted to floats.
This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use
dtype=object
arrays instead.
The problem is that since you are assigning things by rows, but dtypes are grouped by columns, so things get cast to object
dtype, which is not a good thing, you lose all efficiency. So one way is to convert which will coerce to float/int dtype as needed.
As we answered in another question, if you construct the frame all at once (or construct column by column) this step will not be needed
In [23]: def convert(x):
....: try:
....: return x.astype(int)
....: except:
....: return x
....:
In [24]: df.apply(convert)
Out[24]:
a b c d
x 10 10 NaN 10
y 1 5 2 3
z 1 2 3 4
In [25]: df.apply(convert).dtypes
Out[25]:
a int64
b int64
c float64
d int64
dtype: object
In [26]: df.apply(convert).to_csv('test.csv')
In [27]: !cat test.csv
,a,b,c,d
x,10,10,,10
y,1,5,2.0,3
z,1,2,3.0,4
The answer I was looking for was a slight variation of what @Jeff proposed in his answer. The credit goes to him. This is what solved my problem in the end for reference:
import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])
df = df.fillna(0)
df = df.astype(int)
df.to_csv('test.csv', sep='t')
If you want to preserve NaN info in the csv which you have exported, then do the below.
P.S : I’m concentrating on column ‘C’ in this case.
df[c] = df[c].fillna('') #filling Nan with empty string
df[c] = df[c].astype(str) #convert the column to string
>>> df
a b c d
x 10 10 10
y 1 5 2.0 3
z 1 2 3.0 4
df[c] = df[c].str.split('.') #split the float value into list based on '.'
>>> df
a b c d
x 10 10 [''] 10
y 1 5 ['2','0'] 3
z 1 2 ['3','0'] 4
df[c] = df[c].str[0] #select 1st element from the list
>>> df
a b c d
x 10 10 10
y 1 5 2 3
z 1 2 3 4
Now, if you export the dataframe to csv, Column ‘c’ will not have float values and the NaN info is preserved.
You can use astype() to specify data type for each column
For example:
import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])
df = df.astype({"a": int, "b": complex, "c" : float, "d" : int})
You can change your DataFrame into Numpy array as a workaround:
np.savetxt(savepath, np.array(df).astype(np.int), fmt='%i', delimiter = ',', header= 'PassengerId,Survived', comments='')
Just write it out as string to csv:
df.to_csv('test.csv', sep='t', na_rep='0', dtype=str)
The simplest solution is to use float_format
in pd.read_csv()
:
df.to_csv('test.csv', sep='t', na_rep=0, float_format='%.0f')
But this applies to all float columns. BTW: Using your code on pandas 1.1.5, all of my columns are float.
Output:
a b c d
x 10 10 0 10
y 1 5 2 3
z 1 2 3 4
Without float_format
:
a b c d
x 10.0 10.0 0 10.0
y 1.0 5.0 2.0 3.0
z 1.0 2.0 3.0 4.0
Here is yet another solution:
df['IntColumnWithNAValues'].fillna(0, inplace=True) #Fill with a value that is out of your range
df['IntColumnWithNAValues'] = df['IntColumnWithNAValues'].astype(int)
df['IntColumnWithNAValues'].replace(0, '', inplace=True)
.csv files doesn’t differentiate between NA or ” (empty string) as it as a text file, so you get to keep your missing fields while converting non null values to int.
You can do this for every column that you want; If you have lots of columns it might be a problem.