Pandas read_json : skip first lines of the file

Question:

Say I have a json file with lines of data like this :

file.json :

{'ID':'098656', 'query':'query_file.txt'}

{'A':1, 'B':2}
{'A':3, 'B':6}
{'A':0, 'B':4}
...

where the first line is just explanations about the given file and how it was created.
I would like to open it with something like :

import pandas as pd
df = pd.read_json('file.json', lines=True)

However, how do I read the data starting on line 3 ? I know that pd.read_csv has a skiprows argument, but it does not look like pd.read_json has one.

I would like something returning a DataFrame with the columns A and B only, and possibly better than dropping the first line and ID and query columns after loading the whole file.

Asked By: Cyril Vallez

||

Answers:

You can read the lines in the file and skip the first n ones, then pass it to pandas:

import pandas as pd
import json


with open('filename.json') as f:
    lines = f.read().splitlines()[2:]

df_tmp = pd.DataFrame(lines)
df_tmp.columns = ['json_data']

df_tmp['json_data'].apply(json.loads)

df = pd.json_normalize(df_tmp['json_data'].apply(json.loads))
Answered By: RJ Adriaansen

We can pass into pandas.read_json a file handler as well. If before that we read part of the data, then only the rest will be converted to DataFrame.

def read_json(file, skiprows=None):
    with open(file) as f:
        if skiprows:
            f.readlines(skiprows)
        df = pd.read_json(f, lines=True) 
    return df
Answered By: Vitalizzare
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.