pandas read_csv and filter columns with usecols

Question:

I have a csv file which isn’t coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I’m trying to understand what is going wrong. I’m using pandas 0.10.1.

edit: fixed bad header usage.

Asked By: chip

||

Answers:

This code achieves what you want — also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use — so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols 😉 for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5
Answered By: tzelleke

If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5
Answered By: chip

The solution lies in understanding these two keyword arguments:

  • names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
  • usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5
Answered By: Mack

You have to just add the index_col=False parameter

df1 = pd.read_csv('foo.csv',
     header=0,
     index_col=False,
     names=["dummy", "date", "loc", "x"], 
     usecols=["dummy", "date", "loc", "x"],
     parse_dates=["date"])
  print df1
Answered By: Auday Berro

Did anyone solve this? I am getting the same problem? and none of these answers helped.

import pandas as pd

# Read the file and specify which column is the date
customer_calls = pd.read_excel("sales.xlsx",
                 
                 usecols=['OrderDate', 'Units', 'Total'],
                 parse_dates=['OrderDate'],
                 header=0,
                 index_col=False,
                 names=["OrderDate", "Region", "Rep", "Item", "Units", "UnitCost", "Total", "Shipped" ])



# Output with dates converted to YYYY-MM-DD
customer_calls["OrderDate"] = pd.to_datetime(customer_calls["OrderDate"]).dt.strftime("%Y%m%d" + "00")
customer_calls.to_excel("sales_date.xlsx", index=False, header=False)

print(customer_calls)

And I get from table like this:

OrderDate   Region  Rep    Item   Units UnitCost    Total   Shipped
15/01/2021  Central Gill    Binder  46   8.99    413.54     TRUE
01/02/2021  Central Smith   Binder  87   15.00   1,305.00   TRUE
07/03/2021  West    Sorvino Binder  27   19.99   139.93     TRUE
10/04/2021  Central Andrews Pencil  66   1.99    131.34     FALSE
14/05/2021  Central Gill    Pencil  53   1.29    68.37      FALSE
17/06/2021  Central Tom     Desk    15   125.00  625.00     TRUE
04/07/2021  East    Jones   Pen Set 62   4.99    309.38     TRUE
07/08/2021  Central Tom     Pen Set 42   23.95   1,005.90   TRUE
10/09/2021  Central Gill    Pencil  47   1.29    9.03       TRUE
14/10/2021  West    Thomp   Binder  57   19.99   1,139.43   FALSE
17/11/2021  Central Jardine Binder  11   4.99    54.89      FALSE
04/12/2021  Central Jardine Binder  94   19.99   1,879.06   FALSE
 

some strange values in used columns:

     OrderDate  Units   Total
0   2020010600     95  189.05
1   2020020900     36  179.64
2   2020031500     56  167.44
3   2020040100     60  299.40
4   2020050500     90  449.10
5   2020060800     60  539.40
6   2020071200     29   57.71
7   2020081500     35  174.65
8   2020090100     32  250.00
9   2020100500     28  251.72
10  2020110800     15  299.85
11  2020121200     67   86.43 

Anyone solved this already?

Answered By: Aida
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.