Pandas read_excel wrong output

Question:

Update: so I’m messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl+F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file – and that’s what’s causing read_excel to behave differently.

My goal is to simply execute the pandas read_excel function. However, I’m running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.

Code

import pandas

data1 = pandas.read_excel(r'D:User FilesDownloadsProgrammingstackoverflowtest_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:User FilesDownloadsProgrammingstackoverflowtest_2.xlsx')
print(data2)

Output

        Name        DOB Class Year  GPA
0  Redeye438 2008-09-22      Fresh    1
1  Redeye439 2009-09-20       Soph    2
2  Redeye440 2010-09-22     Junior    3
3  Redeye441 2011-09-20     Senior    4
4  Redeye442 2008-09-20      Fresh    4
5  Redeye443 2009-09-22       Soph    3
                             Name  DOB  Class Year  GPA
Redeye438 2011-09-20 Fresh      1  NaN         NaN  NaN
Redeye439 2010-09-22 Soph       2  NaN         NaN  NaN
Redeye440 2009-09-20 Junior     3  NaN         NaN  NaN
Redeye441 2008-09-22 Senior     4  NaN         NaN  NaN
Redeye442 2011-09-22 Fresh      4  NaN         NaN  NaN
Redeye443 2010-09-20 Soph       3  NaN         NaN  NaN

Why are the columns mapped incorrectly for data2?

The excel files in question (the only difference is the data in the DOB column):
Excel file downloads
Screenshot of test_1.xlsx
Screenshot of test_2.xlsx

Asked By: holts-shoe

||

Answers:

It looks like you’re using an older version of Pandas because I can’t reproduce the issue on the latest version.

import pandas as pd 

pd.show_versions()

INSTALLED VERSIONS
------------------
python           : 3.10.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
pandas           : 1.5.0

As you can see below, the columns of data2 are mapped correctly.

import pandas as pd

data1 = pd.read_excel(r"C:UsersabokeyDownloadstest_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:UsersabokeyDownloadstest_2.xlsx")
print(data2)

        Name        DOB Class Year  GPA
0  Redeye438 2008-09-22      Fresh    1
1  Redeye439 2009-09-20       Soph    2
2  Redeye440 2010-09-22     Junior    3
3  Redeye441 2011-09-20     Senior    4
4  Redeye442 2008-09-20      Fresh    4
5  Redeye443 2009-09-22       Soph    3
        Name        DOB Class Year  GPA
0  Redeye438 2011-09-20      Fresh    1
1  Redeye439 2010-09-22       Soph    2
2  Redeye440 2009-09-20     Junior    3
3  Redeye441 2008-09-22     Senior    4
4  Redeye442 2011-09-22      Fresh    4
5  Redeye443 2010-09-20       Soph    3

However, you’re right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7×3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling pandas.read_excel.

So in principle, upgrading your pandas version should fix the problem :

pip install --upgrade pandas

If the problem persists, clean your Excel file test2.xlsx by doing this :

  1. Open up the file in Excel
  2. Click on Find & Select
  3. Choose Go To Special and Blanks
  4. Click on Clear All
  5. Save your changes

enter image description here

Answered By: abokey
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.