_corrupt_record error when reading a JSON file into Spark

Question:

I’ve got this JSON file

{
    "a": 1, 
    "b": 2
}

which has been obtained with Python json.dump method.
Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I’m doing this

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json(‘my_file.json’)

print df.show()

The print statement spits out this though:

+---------------+
|_corrupt_record|
+---------------+
|              {|
|       "a": 1, |
|         "b": 2|
|              }|
+---------------+

Anyone knows what’s going on and why it is not interpreting the file correctly?

Asked By: mar tin

||

Answers:

You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

If your json file looks like this it will give you the expected dataframe:

{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }

....
df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+
Answered By: Bernhard

Adding to @Bernhard’s great answer

# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
    js = json.load(jsonfile)      

# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
    for d in js:
        json.dump(d, outfile)
        outfile.write('n')
Answered By: George Fisher

If you want to leave your JSON file as it is (without stripping new lines characters n), include multiLine=True keyword argument

sc = SparkContext() 
sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json', multiLine=True)

print df.show()
Answered By: wiggy

In Spark 2.2+ you can read json file of multiline using following command.

val dataframe = spark.read.option("multiline",true).json( " filePath ")

if there is json object per line then,

val dataframe = spark.read.json(filepath)
Answered By: Murtaza Zaveri

I want to share my experience in which I have a JSON column String but with Python notation, which means I have None instead of null, False instead of false and True instead of true.

When parsing this column, spark returns me a column named _corrupt_record. So what I had to do before parsing the JSON String is replacing the Python notation with the standard JSON notation:

df.withColumn("json_notation",
    F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", "None", "null"), "False", "false") ,"True", "true")

After this transformation I was then able to use for example the function F.from_json() on the json_notation column and here Pyspark was able to correctly parse the JSON object.

Answered By: Vzzarr

One another reason of this happening could be file encoding. If the file you are reading is say for example in latin encoding you will get this issue. Try using .option("encoding", "cp1252") while reading file. This resolved the issue for me

Answered By: Sundeep Yadav
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.