Why is my Column Data is Off by One in Pandas?

Question:

I’m using the Pandas library to work with text because I find it far easier than the csv module. Here’s the problem. I have a .csv file with multiple columns: subtitle, title, and description. Here’s how I access the row content within each column.

colnames = ['subtitle', 'description', 'title']
data = pandas.read_csv('C:UsersBcwitems.csv', names=colnames)
subtit = list(data.subtitle)
desc = list(data.description)
title = list(data.title)

for line in zip(subtit, desc, title):
    print line

The issue is that, for whatever reason, when I print line, the expected subtitle isn’t printed. When I print each desc, the title shows up. And when I print subtit by itself, the description is printed. Thus, it appears that each column is off by -1. Can anyone explain this behavior? Is it expected and how do I avoid it?

Asked By: Bee Smears

||

Answers:

Not sure if this is an answer, But it was too long for the comment. Feel free to ignore it.

>>> from itertools import izip_longest
>>> 
>>> l1 = [1,2]
>>> l2 = [1,2,3,4,5]
>>> l3 = [1,2,3]
>>> 
>>> for line in izip_longest(l1,l2,l3):
...     print line

will print :

(1, 1, 1)
(2, 2, 2)
(None, 3, 3)
(None, 4, None)
(None, 5, None)
Answered By: Vor

It appears that I solved the problem – tho I didn’t find this anywhere in the docs, so perhaps a more experienced Pandas users can explain why/how. I certainly cannot.

Here’s what I did: I deleted an unused column (the last column in my .csv file), and that reset the indices to their proper/expected order. I have no idea what explains the behavior (or its correction) – whether it’s related to my .csv file or whether it’s a Pandas thing (and perhaps only a Pandas’ issue when working with text). I don’t know.

Either way, I really appreciate all of help!! I got lucky this time.

Answered By: Bee Smears

I think you were trying to load a file with 4 columns but only gave 3 col names. If you only need to load the first 3 columns, use

data = pandas.read_csv('C:UsersBcwitems.csv', names=colnames, usecols=[0,1,2])

You don’t have to delete the unused column in the file.

By default, read_csv loads all columns, and in your case #cols = #colnames+1, so the first column is used as dataframe index. All the remaining columns are shifted by 1.

Answered By: Happy001

I had a similar problem, turns out the .csv I was trying to download had no comma at the end of the header row, but did have commas at the end of every other row. Passing index_col=False (not index_col=None, the default) forces pandas to create an index column instead of inferring one, which got my data to line up correctly.

Answered By: Beware_the_Leopard

I have added index_col=False for pd.read_csv, it’s OK right now.

Answered By: ah bon
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.