Python – Pandas "BadGzipFile" Error When Reading in ".json.gz" File

Question:

I am trying to read in data from a ".json.gz" file as a dataframe. I keep getting an error indicating that it is a "BadGzipFile". However, when I unzip the file manually (i.e., just double clicking it in my finder), I am able to successfully open the json file. This leads me to believe that the file is fine, but when I run the below code in Python, I receive the "BadGzipFile" error.

I am very new to .gzip files and have done a fair bit of research trying to figure out what the issue is. So far, I have been unsuccessful. Any help would be greatly appreciated!

Here is my code:

import os
import json
import gzip

file_path = '/data/data_0_0_0.json.gz'

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, compression='gzip', lines=True)

And here is the error I am receiving:

BadGzipFile: Not a gzipped file (b'{"')
Asked By: welcometotheshire

||

Answers:

What’s happening with your code here:

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, compression='gzip', lines=True)

Is that you’re opening a Gzip file at file_path. Then you’re telling Pandas that the thing that you opened (f), is itself another Gzip file. It isn’t; it’s a Json file. When it says BadGzipFile with that starting bracket, it is telling you that the file it found starts with a bracket instead of the Gzip file’s magic number.


You should change it either to open the file with gzip and then directly read the resulting file or have Pandas read the file.

The first would be:

with gzip.open(file_path, 'rb') as f:
    df = pd.read_json(f, lines=True)

The second is actually easier. Because pd.read_json will infer the compression format based on the file name and your file ends with .gz, you can just write:

df = pd.read_json(file_path)
Answered By: ifly6
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.