pandas read_csv and filter columns with usecols
Question:
I have a csv file which isn’t coming in correctly with pandas.read_csv
when I filter the columns with usecols
and use multiple indexes.
import pandas as pd
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
f = open('foo.csv', 'w')
f.write(csv)
f.close()
df1 = pd.read_csv('foo.csv',
header=0,
names=["dummy", "date", "loc", "x"],
index_col=["date", "loc"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"])
print df1
# Ignore the dummy columns
df2 = pd.read_csv('foo.csv',
index_col=["date", "loc"],
usecols=["date", "loc", "x"], # <----------- Changed
parse_dates=["date"],
header=0,
names=["dummy", "date", "loc", "x"])
print df2
I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.
In [118]: %run test.py
dummy x
date loc
2009-01-01 a bar 1
2009-01-02 a bar 3
2009-01-03 a bar 5
2009-01-01 b bar 1
2009-01-02 b bar 3
2009-01-03 b bar 5
date
date loc
a 1 20090101
3 20090102
5 20090103
b 1 20090101
3 20090102
5 20090103
Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I’m trying to understand what is going wrong. I’m using pandas 0.10.1.
edit: fixed bad header usage.
Answers:
This code achieves what you want — also its weird and certainly buggy:
I observed that it works when:
a) you specify the index_col
rel. to the number of columns you really use — so its three columns in this example, not four (you drop dummy
and start counting from then onwards)
b) same for parse_dates
c) not so for usecols
😉 for obvious reasons
d) here I adapted the names
to mirror this behaviour
import pandas as pd
from StringIO import StringIO
csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""
df = pd.read_csv(StringIO(csv),
index_col=[0,1],
usecols=[1,2,3],
parse_dates=[0],
header=0,
names=["date", "loc", "", "x"])
print df
which prints
x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
If your csv file contains extra data, columns can be deleted from the DataFrame after import.
import pandas as pd
from StringIO import StringIO
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
df = pd.read_csv(StringIO(csv),
index_col=["date", "loc"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"],
header=0,
names=["dummy", "date", "loc", "x"])
del df['dummy']
Which gives us:
x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
The solution lies in understanding these two keyword arguments:
- names is only necessary when there is no header row in your file and you want to specify other arguments (such as
usecols
) using column names rather than integer indices.
- usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.
So because you have a header row, passing header=0
is sufficient and additionally passing names
appears to be confusing pd.read_csv
.
Removing names
from the second call gives the desired output:
import pandas as pd
from StringIO import StringIO
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
df = pd.read_csv(StringIO(csv),
header=0,
index_col=["date", "loc"],
usecols=["date", "loc", "x"],
parse_dates=["date"])
Which gives us:
x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
You have to just add the index_col=False
parameter
df1 = pd.read_csv('foo.csv',
header=0,
index_col=False,
names=["dummy", "date", "loc", "x"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"])
print df1
Did anyone solve this? I am getting the same problem? and none of these answers helped.
import pandas as pd
# Read the file and specify which column is the date
customer_calls = pd.read_excel("sales.xlsx",
usecols=['OrderDate', 'Units', 'Total'],
parse_dates=['OrderDate'],
header=0,
index_col=False,
names=["OrderDate", "Region", "Rep", "Item", "Units", "UnitCost", "Total", "Shipped" ])
# Output with dates converted to YYYY-MM-DD
customer_calls["OrderDate"] = pd.to_datetime(customer_calls["OrderDate"]).dt.strftime("%Y%m%d" + "00")
customer_calls.to_excel("sales_date.xlsx", index=False, header=False)
print(customer_calls)
And I get from table like this:
OrderDate Region Rep Item Units UnitCost Total Shipped
15/01/2021 Central Gill Binder 46 8.99 413.54 TRUE
01/02/2021 Central Smith Binder 87 15.00 1,305.00 TRUE
07/03/2021 West Sorvino Binder 27 19.99 139.93 TRUE
10/04/2021 Central Andrews Pencil 66 1.99 131.34 FALSE
14/05/2021 Central Gill Pencil 53 1.29 68.37 FALSE
17/06/2021 Central Tom Desk 15 125.00 625.00 TRUE
04/07/2021 East Jones Pen Set 62 4.99 309.38 TRUE
07/08/2021 Central Tom Pen Set 42 23.95 1,005.90 TRUE
10/09/2021 Central Gill Pencil 47 1.29 9.03 TRUE
14/10/2021 West Thomp Binder 57 19.99 1,139.43 FALSE
17/11/2021 Central Jardine Binder 11 4.99 54.89 FALSE
04/12/2021 Central Jardine Binder 94 19.99 1,879.06 FALSE
some strange values in used columns:
OrderDate Units Total
0 2020010600 95 189.05
1 2020020900 36 179.64
2 2020031500 56 167.44
3 2020040100 60 299.40
4 2020050500 90 449.10
5 2020060800 60 539.40
6 2020071200 29 57.71
7 2020081500 35 174.65
8 2020090100 32 250.00
9 2020100500 28 251.72
10 2020110800 15 299.85
11 2020121200 67 86.43
Anyone solved this already?
I have a csv file which isn’t coming in correctly with pandas.read_csv
when I filter the columns with usecols
and use multiple indexes.
import pandas as pd
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
f = open('foo.csv', 'w')
f.write(csv)
f.close()
df1 = pd.read_csv('foo.csv',
header=0,
names=["dummy", "date", "loc", "x"],
index_col=["date", "loc"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"])
print df1
# Ignore the dummy columns
df2 = pd.read_csv('foo.csv',
index_col=["date", "loc"],
usecols=["date", "loc", "x"], # <----------- Changed
parse_dates=["date"],
header=0,
names=["dummy", "date", "loc", "x"])
print df2
I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.
In [118]: %run test.py
dummy x
date loc
2009-01-01 a bar 1
2009-01-02 a bar 3
2009-01-03 a bar 5
2009-01-01 b bar 1
2009-01-02 b bar 3
2009-01-03 b bar 5
date
date loc
a 1 20090101
3 20090102
5 20090103
b 1 20090101
3 20090102
5 20090103
Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I’m trying to understand what is going wrong. I’m using pandas 0.10.1.
edit: fixed bad header usage.
This code achieves what you want — also its weird and certainly buggy:
I observed that it works when:
a) you specify the index_col
rel. to the number of columns you really use — so its three columns in this example, not four (you drop dummy
and start counting from then onwards)
b) same for parse_dates
c) not so for usecols
😉 for obvious reasons
d) here I adapted the names
to mirror this behaviour
import pandas as pd
from StringIO import StringIO
csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""
df = pd.read_csv(StringIO(csv),
index_col=[0,1],
usecols=[1,2,3],
parse_dates=[0],
header=0,
names=["date", "loc", "", "x"])
print df
which prints
x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
If your csv file contains extra data, columns can be deleted from the DataFrame after import.
import pandas as pd
from StringIO import StringIO
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
df = pd.read_csv(StringIO(csv),
index_col=["date", "loc"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"],
header=0,
names=["dummy", "date", "loc", "x"])
del df['dummy']
Which gives us:
x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
The solution lies in understanding these two keyword arguments:
- names is only necessary when there is no header row in your file and you want to specify other arguments (such as
usecols
) using column names rather than integer indices. - usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.
So because you have a header row, passing header=0
is sufficient and additionally passing names
appears to be confusing pd.read_csv
.
Removing names
from the second call gives the desired output:
import pandas as pd
from StringIO import StringIO
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
df = pd.read_csv(StringIO(csv),
header=0,
index_col=["date", "loc"],
usecols=["date", "loc", "x"],
parse_dates=["date"])
Which gives us:
x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
You have to just add the index_col=False
parameter
df1 = pd.read_csv('foo.csv',
header=0,
index_col=False,
names=["dummy", "date", "loc", "x"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"])
print df1
Did anyone solve this? I am getting the same problem? and none of these answers helped.
import pandas as pd
# Read the file and specify which column is the date
customer_calls = pd.read_excel("sales.xlsx",
usecols=['OrderDate', 'Units', 'Total'],
parse_dates=['OrderDate'],
header=0,
index_col=False,
names=["OrderDate", "Region", "Rep", "Item", "Units", "UnitCost", "Total", "Shipped" ])
# Output with dates converted to YYYY-MM-DD
customer_calls["OrderDate"] = pd.to_datetime(customer_calls["OrderDate"]).dt.strftime("%Y%m%d" + "00")
customer_calls.to_excel("sales_date.xlsx", index=False, header=False)
print(customer_calls)
And I get from table like this:
OrderDate Region Rep Item Units UnitCost Total Shipped
15/01/2021 Central Gill Binder 46 8.99 413.54 TRUE
01/02/2021 Central Smith Binder 87 15.00 1,305.00 TRUE
07/03/2021 West Sorvino Binder 27 19.99 139.93 TRUE
10/04/2021 Central Andrews Pencil 66 1.99 131.34 FALSE
14/05/2021 Central Gill Pencil 53 1.29 68.37 FALSE
17/06/2021 Central Tom Desk 15 125.00 625.00 TRUE
04/07/2021 East Jones Pen Set 62 4.99 309.38 TRUE
07/08/2021 Central Tom Pen Set 42 23.95 1,005.90 TRUE
10/09/2021 Central Gill Pencil 47 1.29 9.03 TRUE
14/10/2021 West Thomp Binder 57 19.99 1,139.43 FALSE
17/11/2021 Central Jardine Binder 11 4.99 54.89 FALSE
04/12/2021 Central Jardine Binder 94 19.99 1,879.06 FALSE
some strange values in used columns:
OrderDate Units Total
0 2020010600 95 189.05
1 2020020900 36 179.64
2 2020031500 56 167.44
3 2020040100 60 299.40
4 2020050500 90 449.10
5 2020060800 60 539.40
6 2020071200 29 57.71
7 2020081500 35 174.65
8 2020090100 32 250.00
9 2020100500 28 251.72
10 2020110800 15 299.85
11 2020121200 67 86.43
Anyone solved this already?