How to read a JSON file in Azure Databricks from Azure Data Lake Store

Question:

I am using Azure Data Lake Store for storing simple JSON files with the following JSON:

{
  "email": "[email protected]",
  "id": "823956724385"
}

The json files name is myJson1.json. The Azure Data Lake Store is mounted successfully to Azure Databricks.

I am able to load successfully the JSON file via

df = spark.read.option("multiline", "true").json(fi.path)

fi.path is a FileInfo Object which is the MyJson1.json file from above.

When i do

spark.read.option("multiline", "true").json(fi.path)
df.show()` 

i get the JSON object printed out correctly (DataFrame) as

+---------------------+------------+
|                email|          id|
+---------------------+------------+
|[email protected]|823956724385|
+---------------------+------------+

What i want to do is, to load the JSON file with json.load(filename), to be able to work with the JSON object within Python.

When i do

with open('adl://.../myJson1.json', 'r') as file:
  jsonObject0 = json.load(file)

then i get the following error

[Errno 2] No such file or directory ‘adl://…/myJson1.json’

When i try (the mount point is correct, i can list the file and also with spark.read into a DataFrame)

    jsonObject = json.load("/mnt/adls/data/myJson1.json")

then i get the following error

‘str’ object has no attribute ‘read’

I have no idea what to do else to get the JSON loaded. My goal is to read the JSON object and iterate through the keys and their values.

Asked By: STORM

||

Answers:

The trick was to use the following syntax for the file url

/dbfs/mnt/adls/data/myJson1.json

i had to add /dbfs/... respectively replace dbfs:/ with /dbfs/ at the beginning of the url.

Then i could use

    with open('/dbfs/mnt/adls/ingress/marketo/update/leads/leads-json1.json', 'r') as f:
      data = f.read()

    jsonObject = json.loads(data)

Maybe it possible easier? But this works for now.

Answered By: STORM

To keep the JSON style and work specifically with the JSON formatted data you can try loading the data in with the following code:

   df = json.loads(dbutils.fs.head(fi.path))

To check the count of key value pairs:

   print(len(df))

Then to loop through the key:values:

for obj in df:
  email = df['email']
  print(email)

Hope this helps.

Answered By: standy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.