_corrupt_record error when reading a JSON file into Spark
Question:
I’ve got this JSON file
{
"a": 1,
"b": 2
}
which has been obtained with Python json.dump method.
Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I’m doing this
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json(‘my_file.json’)
print df.show()
The print statement spits out this though:
+---------------+
|_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
+---------------+
Anyone knows what’s going on and why it is not interpreting the file correctly?
Answers:
You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
If your json file looks like this it will give you the expected dataframe:
{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }
....
df.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
Adding to @Bernhard’s great answer
# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
js = json.load(jsonfile)
# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
for d in js:
json.dump(d, outfile)
outfile.write('n')
If you want to leave your JSON file as it is (without stripping new lines characters n
), include multiLine=True
keyword argument
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json', multiLine=True)
print df.show()
In Spark 2.2+ you can read json file of multiline using following command.
val dataframe = spark.read.option("multiline",true).json( " filePath ")
if there is json object per line then,
val dataframe = spark.read.json(filepath)
I want to share my experience in which I have a JSON column String but with Python notation, which means I have None
instead of null
, False
instead of false
and True
instead of true
.
When parsing this column, spark returns me a column named _corrupt_record
. So what I had to do before parsing the JSON String is replacing the Python notation with the standard JSON notation:
df.withColumn("json_notation",
F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", "None", "null"), "False", "false") ,"True", "true")
After this transformation I was then able to use for example the function F.from_json()
on the json_notation
column and here Pyspark was able to correctly parse the JSON object.
One another reason of this happening could be file encoding. If the file you are reading is say for example in latin encoding you will get this issue. Try using .option("encoding", "cp1252") while reading file. This resolved the issue for me
I’ve got this JSON file
{
"a": 1,
"b": 2
}
which has been obtained with Python json.dump method.
Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I’m doing this
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json(‘my_file.json’)
print df.show()
The print statement spits out this though:
+---------------+
|_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
+---------------+
Anyone knows what’s going on and why it is not interpreting the file correctly?
You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
If your json file looks like this it will give you the expected dataframe:
{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }
....
df.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
Adding to @Bernhard’s great answer
# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
js = json.load(jsonfile)
# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
for d in js:
json.dump(d, outfile)
outfile.write('n')
If you want to leave your JSON file as it is (without stripping new lines characters n
), include multiLine=True
keyword argument
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json', multiLine=True)
print df.show()
In Spark 2.2+ you can read json file of multiline using following command.
val dataframe = spark.read.option("multiline",true).json( " filePath ")
if there is json object per line then,
val dataframe = spark.read.json(filepath)
I want to share my experience in which I have a JSON column String but with Python notation, which means I have None
instead of null
, False
instead of false
and True
instead of true
.
When parsing this column, spark returns me a column named _corrupt_record
. So what I had to do before parsing the JSON String is replacing the Python notation with the standard JSON notation:
df.withColumn("json_notation",
F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", "None", "null"), "False", "false") ,"True", "true")
After this transformation I was then able to use for example the function F.from_json()
on the json_notation
column and here Pyspark was able to correctly parse the JSON object.
One another reason of this happening could be file encoding. If the file you are reading is say for example in latin encoding you will get this issue. Try using .option("encoding", "cp1252") while reading file. This resolved the issue for me