Pandas won't open weird TSV file

Question:

The TSV file here
https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2022/WEOApr2022all.ashx
which comes from
https://www.imf.org/en/Publications/WEO/weo-database/2022/April/download-entire-database
will not open on Pandas. I have tried several things: using a separator for tabs, opening it with read_excel (the website says it’s compatible with all modern systems).

import pandas as pd

path = "C:..\WEOApr2022all.xls"

dataframe = pd.read_csv(path, sep="\t", encoding='windows-1252')

Error: chunks = self._reader.read_low_memory(nrows)
File "pandas_libsparsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas_libsparsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libsparsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas_libsparsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 14

Asked By: flowover9

||

Answers:

Despite its file extension it is a tab delimited, UTF16 encoded text file and can be read using

dataframe = pd.read_csv(path, sep='\t', encoding='utf_16le', engine='python', skipfooter=2)

8624 rows × 58 columns

Note: You need to specify the delimiter as a regex ('\t‘ or r't') instead of a single character ('t') and hence use the python engine because each lines ends with a delimiter. If you use 't' you’ll get an extra empty column at the end.

Answered By: Stef

That’s a tab-separated file with a fake extension with extra problems:

  • it uses the UTF16-LE encoding without a BOM
  • it has a trailing tab at the end which should result in an empty column.
  • Finally, it has a one-line footer at the end
International Monetary Fund, World Economic Outlook Database, April 2022

You can read this using :

df=pd.read_table(r"C:UserspankanDownloadsWEOApr2022all.xls",
                 sep='t',
                 encoding='utf_16le',
                 usecols=lambda c: not c.startswith('Unnamed:'),
                 skipfooter=1)

or

df=pd.read_table(r"C:UserspankanDownloadsWEOApr2022all.xls",
                 encoding='utf_16le',
                 usecols=lambda c: not c.startswith('Unnamed:'),
                 skipfooter=1)

read_csv drops empty lines by default

Some extremely lazy developers try to fake Excel files by generating a CSV or even HTML table and send it as a file with a fake .xls extension. This may fool users, but not Excel itself. Excel will recognize this is a text file and try to import it using the user’s defaults for separators, number and date formats. If those don’t match the developer’s formats, you’ll get errors.

Answered By: Panagiotis Kanavos
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.