_corrupt_record error when reading a JSON file into Spark

Question

I’ve got this JSON file

{
    "a": 1, 
    "b": 2
}

which has been obtained with Python json.dump method.
Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I’m doing this

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json(‘my_file.json’)

print df.show()

The print statement spits out this though:

+---------------+
|_corrupt_record|
+---------------+
|              {|
|       "a": 1, |
|         "b": 2|
|              }|
+---------------+

Anyone knows what’s going on and why it is not interpreting the file correctly?

Asked By: mar tin

||

Source

Answer 1

You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

If your json file looks like this it will give you the expected dataframe:

{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }

....
df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

Answered By: Bernhard

Answer 2

Adding to @Bernhard’s great answer

# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
    js = json.load(jsonfile)      

# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
    for d in js:
        json.dump(d, outfile)
        outfile.write('n')

Answered By: George Fisher

Answer 3

If you want to leave your JSON file as it is (without stripping new lines characters n), include multiLine=True keyword argument

sc = SparkContext() 
sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json', multiLine=True)

print df.show()

Answered By: wiggy

Answer 4

In Spark 2.2+ you can read json file of multiline using following command.

val dataframe = spark.read.option("multiline",true).json( " filePath ")

if there is json object per line then,

val dataframe = spark.read.json(filepath)

Answered By: Murtaza Zaveri

Answer 5

I want to share my experience in which I have a JSON column String but with Python notation, which means I have None instead of null, False instead of false and True instead of true.

When parsing this column, spark returns me a column named _corrupt_record. So what I had to do before parsing the JSON String is replacing the Python notation with the standard JSON notation:

df.withColumn("json_notation",
    F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", "None", "null"), "False", "false") ,"True", "true")

After this transformation I was then able to use for example the function F.from_json() on the json_notation column and here Pyspark was able to correctly parse the JSON object.

Answered By: Vzzarr

Answer 6

One another reason of this happening could be file encoding. If the file you are reading is say for example in latin encoding you will get this issue. Try using .option("encoding", "cp1252") while reading file. This resolved the issue for me

Answered By: Sundeep Yadav

_corrupt_record error when reading a JSON file into Spark

Question:

Answers: