It looks like a List but I can't index into it: ValueError: Length of values (2) does not match length of index (279999)

Question:

I am importing the CSV file from here: https://raw.githubusercontent.com/kwartler/Harvard_DataMining_Business_Student/master/BookDataSets/LaptopSales.csv

This code works:

from dfply import *
import pandas as pd
df = pd.read_csv("LaptopSales.csv")
(df >> select(X["Date"]) >> mutate(AdjDate = (X.Date.str.split(" "))) >> head(3))

and produces this result:

    Date                AdjDate
0   01-01-2008 00:01    [01-01-2008, 00:01]
1   01-01-2008 00:02    [01-01-2008, 00:02]
2   01-01-2008 00:04    [01-01-2008, 00:04]

But when I try to extract the first element in the list:

from dfply import *
import pandas as pd
df = pd.read_csv("LaptopSales.csv")
(df >> select(X["Date"]) >> mutate(AdjDate = (X.Date.str.split(" ")[0])) >> head(3))

I get a wall of error culminating in:

ValueError: Length of values (2) does not match length of index (279999)
Asked By: nicomp

||

Answers:

AdjDate = (X.Date.str.split(" ")[0]))

Is in fact comparing 2 series index by index and return a series with the length of primary series.

Then you can not store it in a 2 lengthed variable and pandas raise error

Answered By: Alireza

The answer is that one of the rows in the CSV file contains a value in the Date column that is NaN. That value can’t be split on " ". Nan is a float: since the split fails to create a list, then the indexing operation fails. It’s row 2913 in the .CSV file: ",51,SE14 6LA,SE8 3JD,460,15,4,2,1.5,Yes,80,Yes,536682,177068,537175,177885"

The reason I didn’t simply delete the question is because the data set is publicly available and appears to be part of a course available through Harvard University: https://github.com/kwartler/Harvard_DataMining_Business_Student

Answered By: nicomp
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.