pandas read_csv and filter columns with usecols

Question

I have a csv file which isn’t coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I’m trying to understand what is going wrong. I’m using pandas 0.10.1.

edit: fixed bad header usage.

Asked By: chip

||

Source

Answer 1

This code achieves what you want — also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use — so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols 😉 for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

Answered By: tzelleke

Answer 2

If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

Answered By: chip

Answer 3

The solution lies in understanding these two keyword arguments:

names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

Answered By: Mack

Answer 4

You have to just add the index_col=False parameter

df1 = pd.read_csv('foo.csv',
     header=0,
     index_col=False,
     names=["dummy", "date", "loc", "x"], 
     usecols=["dummy", "date", "loc", "x"],
     parse_dates=["date"])
  print df1

Answered By: Auday Berro

Answer 5

Did anyone solve this? I am getting the same problem? and none of these answers helped.

import pandas as pd

# Read the file and specify which column is the date
customer_calls = pd.read_excel("sales.xlsx",
                 
                 usecols=['OrderDate', 'Units', 'Total'],
                 parse_dates=['OrderDate'],
                 header=0,
                 index_col=False,
                 names=["OrderDate", "Region", "Rep", "Item", "Units", "UnitCost", "Total", "Shipped" ])



# Output with dates converted to YYYY-MM-DD
customer_calls["OrderDate"] = pd.to_datetime(customer_calls["OrderDate"]).dt.strftime("%Y%m%d" + "00")
customer_calls.to_excel("sales_date.xlsx", index=False, header=False)

print(customer_calls)

And I get from table like this:

OrderDate   Region  Rep    Item   Units UnitCost    Total   Shipped
15/01/2021  Central Gill    Binder  46   8.99    413.54     TRUE
01/02/2021  Central Smith   Binder  87   15.00   1,305.00   TRUE
07/03/2021  West    Sorvino Binder  27   19.99   139.93     TRUE
10/04/2021  Central Andrews Pencil  66   1.99    131.34     FALSE
14/05/2021  Central Gill    Pencil  53   1.29    68.37      FALSE
17/06/2021  Central Tom     Desk    15   125.00  625.00     TRUE
04/07/2021  East    Jones   Pen Set 62   4.99    309.38     TRUE
07/08/2021  Central Tom     Pen Set 42   23.95   1,005.90   TRUE
10/09/2021  Central Gill    Pencil  47   1.29    9.03       TRUE
14/10/2021  West    Thomp   Binder  57   19.99   1,139.43   FALSE
17/11/2021  Central Jardine Binder  11   4.99    54.89      FALSE
04/12/2021  Central Jardine Binder  94   19.99   1,879.06   FALSE

some strange values in used columns:

     OrderDate  Units   Total
0   2020010600     95  189.05
1   2020020900     36  179.64
2   2020031500     56  167.44
3   2020040100     60  299.40
4   2020050500     90  449.10
5   2020060800     60  539.40
6   2020071200     29   57.71
7   2020081500     35  174.65
8   2020090100     32  250.00
9   2020100500     28  251.72
10  2020110800     15  299.85
11  2020121200     67   86.43

Anyone solved this already?

Answered By: Aida

pandas read_csv and filter columns with usecols

Question:

Answers: