Pandas read_json with int64 values raises ValueError: Value is too big

Question:

I’m trying to read in json files into dataframes.

df = pd.read_json('test.log', lines=True)

However there are values which are int64 and Pandas raises:

ValueError: Value is too big

I tried setting precise_float to True, but this didn’t solve it.

It works when I do it line by line:

df = pd.DataFrame()
with open('test.log') as f:
    for line in f:
        data = json.loads(line)
        df = df.append(data, ignore_index=True)

However this is very slow. Already for files around 50k lines it takes a very long time.

Is there a way I can set the value of certain columns to use int64?

Asked By: user3605780

||

Answers:

After updating pandas to a newer version (tested with 1.0.3), this workaround by artdgn can be applied to overwrite the loads() function in pandas.io.json._json, which is ultimately used when pd.read_json() is called.

Copying the workaround in case the links above stop working:


import pandas as pd

# monkeypatch using standard python json module

import json

pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)

# monkeypatch using faster simplejson module
import simplejson
pd.io.json._json.loads = lambda s, *a, **kw: simplejson.loads(s)

# normalising (unnesting) at the same time (for nested jsons)
pd.io.json._json.loads = lambda s, *a, **kw: pandas.json_normalize(simplejson.loads(s))

After overwriting the loads() function with 1 of the 3 methods described by artdgn, read_json() also works with int64.

Answered By: Marcel

This is a well-known issue.
The decoding of big numbers is still not implemented in the pandas’ fork of the ultrajson library. The closest implementation was not merged. Whatever it was, you can use the workarounds provided in other answers.

I solved the problem with duckdb library:

import duckdb

df = duckdb.query('''
SELECT * 
FROM read_json('test.log', auto_detect=True, sample_size=100000)
''').to_df() 

If needed, find the corresponding read_json options fitting your purpose at https://duckdb.org/docs/extensions/json.html

Answered By: Pavel Prochazka
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.