Pandas won't open weird TSV file
Question:
The TSV file here
https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2022/WEOApr2022all.ashx
which comes from
https://www.imf.org/en/Publications/WEO/weo-database/2022/April/download-entire-database
will not open on Pandas. I have tried several things: using a separator for tabs, opening it with read_excel (the website says it’s compatible with all modern systems).
import pandas as pd
path = "C:..\WEOApr2022all.xls"
dataframe = pd.read_csv(path, sep="\t", encoding='windows-1252')
Error: chunks = self._reader.read_low_memory(nrows)
File "pandas_libsparsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas_libsparsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libsparsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas_libsparsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 14
Answers:
Despite its file extension it is a tab delimited, UTF16 encoded text file and can be read using
dataframe = pd.read_csv(path, sep='\t', encoding='utf_16le', engine='python', skipfooter=2)
8624 rows × 58 columns
Note: You need to specify the delimiter as a regex ('\t
‘ or r't'
) instead of a single character ('t'
) and hence use the python
engine because each lines ends with a delimiter. If you use 't'
you’ll get an extra empty column at the end.
That’s a tab-separated file with a fake extension with extra problems:
- it uses the UTF16-LE encoding without a BOM
- it has a trailing tab at the end which should result in an empty column.
- Finally, it has a one-line footer at the end
International Monetary Fund, World Economic Outlook Database, April 2022
You can read this using :
df=pd.read_table(r"C:UserspankanDownloadsWEOApr2022all.xls",
sep='t',
encoding='utf_16le',
usecols=lambda c: not c.startswith('Unnamed:'),
skipfooter=1)
or
df=pd.read_table(r"C:UserspankanDownloadsWEOApr2022all.xls",
encoding='utf_16le',
usecols=lambda c: not c.startswith('Unnamed:'),
skipfooter=1)
read_csv
drops empty lines by default
Some extremely lazy developers try to fake Excel files by generating a CSV or even HTML table and send it as a file with a fake .xls extension. This may fool users, but not Excel itself. Excel will recognize this is a text file and try to import it using the user’s defaults for separators, number and date formats. If those don’t match the developer’s formats, you’ll get errors.
The TSV file here
https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2022/WEOApr2022all.ashx
which comes from
https://www.imf.org/en/Publications/WEO/weo-database/2022/April/download-entire-database
will not open on Pandas. I have tried several things: using a separator for tabs, opening it with read_excel (the website says it’s compatible with all modern systems).
import pandas as pd
path = "C:..\WEOApr2022all.xls"
dataframe = pd.read_csv(path, sep="\t", encoding='windows-1252')
Error: chunks = self._reader.read_low_memory(nrows)
File "pandas_libsparsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas_libsparsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libsparsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas_libsparsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 14
Despite its file extension it is a tab delimited, UTF16 encoded text file and can be read using
dataframe = pd.read_csv(path, sep='\t', encoding='utf_16le', engine='python', skipfooter=2)
8624 rows × 58 columns
Note: You need to specify the delimiter as a regex ('\t
‘ or r't'
) instead of a single character ('t'
) and hence use the python
engine because each lines ends with a delimiter. If you use 't'
you’ll get an extra empty column at the end.
That’s a tab-separated file with a fake extension with extra problems:
- it uses the UTF16-LE encoding without a BOM
- it has a trailing tab at the end which should result in an empty column.
- Finally, it has a one-line footer at the end
International Monetary Fund, World Economic Outlook Database, April 2022
You can read this using :
df=pd.read_table(r"C:UserspankanDownloadsWEOApr2022all.xls",
sep='t',
encoding='utf_16le',
usecols=lambda c: not c.startswith('Unnamed:'),
skipfooter=1)
or
df=pd.read_table(r"C:UserspankanDownloadsWEOApr2022all.xls",
encoding='utf_16le',
usecols=lambda c: not c.startswith('Unnamed:'),
skipfooter=1)
read_csv
drops empty lines by default
Some extremely lazy developers try to fake Excel files by generating a CSV or even HTML table and send it as a file with a fake .xls extension. This may fool users, but not Excel itself. Excel will recognize this is a text file and try to import it using the user’s defaults for separators, number and date formats. If those don’t match the developer’s formats, you’ll get errors.