Pandas read_json with int64 values raises ValueError: Value is too big
Question:
I’m trying to read in json files into dataframes.
df = pd.read_json('test.log', lines=True)
However there are values which are int64 and Pandas raises:
ValueError: Value is too big
I tried setting precise_float
to True
, but this didn’t solve it.
It works when I do it line by line:
df = pd.DataFrame()
with open('test.log') as f:
for line in f:
data = json.loads(line)
df = df.append(data, ignore_index=True)
However this is very slow. Already for files around 50k lines it takes a very long time.
Is there a way I can set the value of certain columns to use int64?
Answers:
After updating pandas to a newer version (tested with 1.0.3), this workaround by artdgn can be applied to overwrite the loads()
function in pandas.io.json._json
, which is ultimately used when pd.read_json()
is called.
Copying the workaround in case the links above stop working:
import pandas as pd
# monkeypatch using standard python json module
import json
pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)
# monkeypatch using faster simplejson module
import simplejson
pd.io.json._json.loads = lambda s, *a, **kw: simplejson.loads(s)
# normalising (unnesting) at the same time (for nested jsons)
pd.io.json._json.loads = lambda s, *a, **kw: pandas.json_normalize(simplejson.loads(s))
After overwriting the loads()
function with 1 of the 3 methods described by artdgn, read_json()
also works with int64
.
This is a well-known issue.
The decoding of big numbers is still not implemented in the pandas’ fork of the ultrajson library. The closest implementation was not merged. Whatever it was, you can use the workarounds provided in other answers.
I solved the problem with duckdb
library:
import duckdb
df = duckdb.query('''
SELECT *
FROM read_json('test.log', auto_detect=True, sample_size=100000)
''').to_df()
If needed, find the corresponding read_json
options fitting your purpose at https://duckdb.org/docs/extensions/json.html
I’m trying to read in json files into dataframes.
df = pd.read_json('test.log', lines=True)
However there are values which are int64 and Pandas raises:
ValueError: Value is too big
I tried setting precise_float
to True
, but this didn’t solve it.
It works when I do it line by line:
df = pd.DataFrame()
with open('test.log') as f:
for line in f:
data = json.loads(line)
df = df.append(data, ignore_index=True)
However this is very slow. Already for files around 50k lines it takes a very long time.
Is there a way I can set the value of certain columns to use int64?
After updating pandas to a newer version (tested with 1.0.3), this workaround by artdgn can be applied to overwrite the loads()
function in pandas.io.json._json
, which is ultimately used when pd.read_json()
is called.
Copying the workaround in case the links above stop working:
import pandas as pd
# monkeypatch using standard python json module
import json
pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)
# monkeypatch using faster simplejson module
import simplejson
pd.io.json._json.loads = lambda s, *a, **kw: simplejson.loads(s)
# normalising (unnesting) at the same time (for nested jsons)
pd.io.json._json.loads = lambda s, *a, **kw: pandas.json_normalize(simplejson.loads(s))
After overwriting the loads()
function with 1 of the 3 methods described by artdgn, read_json()
also works with int64
.
This is a well-known issue.
The decoding of big numbers is still not implemented in the pandas’ fork of the ultrajson library. The closest implementation was not merged. Whatever it was, you can use the workarounds provided in other answers.
I solved the problem with duckdb
library:
import duckdb
df = duckdb.query('''
SELECT *
FROM read_json('test.log', auto_detect=True, sample_size=100000)
''').to_df()
If needed, find the corresponding read_json
options fitting your purpose at https://duckdb.org/docs/extensions/json.html