Python pandas: output dataframe to csv with integers

Question:

I have a pandas.DataFrame that I wish to export to a CSV file. However, pandas seems to write some of the values as float instead of int types. I couldn’t not find how to change this behavior.

Building a data frame:

df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'], dtype=int)
x = pandas.Series([10,10,10], index=['a','b','d'], dtype=int)
y = pandas.Series([1,5,2,3], index=['a','b','c','d'], dtype=int)
z = pandas.Series([1,2,3,4], index=['a','b','c','d'], dtype=int)
df.loc['x']=x; df.loc['y']=y; df.loc['z']=z

View it:

>>> df
    a   b    c   d
x  10  10  NaN  10
y   1   5    2   3
z   1   2    3   4

Export it:

>>> df.to_csv('test.csv', sep='t', na_rep='0', dtype=int)
>>> for l in open('test.csv'): print l.strip('n')
        a       b       c       d
x       10.0    10.0    0       10.0
y       1       5       2       3
z       1       2       3       4

Why do the tens have a dot zero ?

Sure, I could just stick this function into my pipeline to reconvert the whole CSV file, but it seems unnecessary:

def lines_as_integer(path):
    handle = open(path)
    yield handle.next()
    for line in handle:
        line = line.split()
        label = line[0]
        values = map(float, line[1:])
        values = map(int, values)
        yield label + 't' + 't'.join(map(str,values)) + 'n'
handle = open(path_table_int, 'w')
handle.writelines(lines_as_integer(path_table_float))
handle.close()
Asked By: xApple

||

Answers:

This is a “gotcha” in pandas (Support for integer NA), where integer columns with NaNs are converted to floats.

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use dtype=object arrays instead.

Answered By: Andy Hayden

The problem is that since you are assigning things by rows, but dtypes are grouped by columns, so things get cast to object dtype, which is not a good thing, you lose all efficiency. So one way is to convert which will coerce to float/int dtype as needed.

As we answered in another question, if you construct the frame all at once (or construct column by column) this step will not be needed

In [23]: def convert(x):
   ....:     try:
   ....:         return x.astype(int)
   ....:     except:
   ....:         return x
   ....:     

In [24]: df.apply(convert)
Out[24]: 
    a   b   c   d
x  10  10 NaN  10
y   1   5   2   3
z   1   2   3   4

In [25]: df.apply(convert).dtypes
Out[25]: 
a      int64
b      int64
c    float64
d      int64
dtype: object

In [26]: df.apply(convert).to_csv('test.csv')

In [27]: !cat test.csv
,a,b,c,d
x,10,10,,10
y,1,5,2.0,3
z,1,2,3.0,4
Answered By: Jeff

The answer I was looking for was a slight variation of what @Jeff proposed in his answer. The credit goes to him. This is what solved my problem in the end for reference:

import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])
df = df.fillna(0)
df = df.astype(int)
df.to_csv('test.csv', sep='t')
Answered By: xApple

If you want to preserve NaN info in the csv which you have exported, then do the below.
P.S : I’m concentrating on column ‘C’ in this case.

df[c] = df[c].fillna('')       #filling Nan with empty string
df[c] = df[c].astype(str)      #convert the column to string 
>>> df
    a   b    c     d
x  10  10         10
y   1   5    2.0   3
z   1   2    3.0   4

df[c] = df[c].str.split('.')   #split the float value into list based on '.'
>>> df
        a   b    c          d
    x  10  10   ['']       10
    y   1   5   ['2','0']   3
    z   1   2   ['3','0']   4

df[c] = df[c].str[0]            #select 1st element from the list
>>> df
    a   b    c   d
x  10  10       10
y   1   5    2   3
z   1   2    3   4

Now, if you export the dataframe to csv, Column ‘c’ will not have float values and the NaN info is preserved.

Answered By: Tad

You can use astype() to specify data type for each column

For example:

import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])

df = df.astype({"a": int, "b": complex, "c" : float, "d" : int})
Answered By: appsdownload

You can change your DataFrame into Numpy array as a workaround:

 np.savetxt(savepath, np.array(df).astype(np.int), fmt='%i', delimiter = ',', header= 'PassengerId,Survived', comments='')
Answered By: LearnDude

Just write it out as string to csv:

df.to_csv('test.csv', sep='t', na_rep='0', dtype=str)
Answered By: Sam Wang

The simplest solution is to use float_format in pd.read_csv():

df.to_csv('test.csv', sep='t', na_rep=0, float_format='%.0f')

But this applies to all float columns. BTW: Using your code on pandas 1.1.5, all of my columns are float.

Output:

    a   b   c   d
x   10  10  0   10
y   1   5   2   3
z   1   2   3   4

Without float_format:

    a   b   c   d
x   10.0    10.0    0    10.0
y    1.0     5.0    2.0   3.0
z    1.0     2.0    3.0   4.0
Answered By: MERose

Here is yet another solution:

df['IntColumnWithNAValues'].fillna(0, inplace=True) #Fill with a value that is out of your range

df['IntColumnWithNAValues'] = df['IntColumnWithNAValues'].astype(int)

df['IntColumnWithNAValues'].replace(0, '', inplace=True)

.csv files doesn’t differentiate between NA or ” (empty string) as it as a text file, so you get to keep your missing fields while converting non null values to int.

You can do this for every column that you want; If you have lots of columns it might be a problem.

Answered By: Arthur Querido
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.