How to read a JSON file in Azure Databricks from Azure Data Lake Store
Question:
I am using Azure Data Lake Store for storing simple JSON files with the following JSON:
{
"email": "[email protected]",
"id": "823956724385"
}
The json files name is myJson1.json
. The Azure Data Lake Store is mounted successfully to Azure Databricks.
I am able to load successfully the JSON file via
df = spark.read.option("multiline", "true").json(fi.path)
fi.path
is a FileInfo
Object which is the MyJson1.json
file from above.
When i do
spark.read.option("multiline", "true").json(fi.path)
df.show()`
i get the JSON object printed out correctly (DataFrame) as
+---------------------+------------+
| email| id|
+---------------------+------------+
|[email protected]|823956724385|
+---------------------+------------+
What i want to do is, to load the JSON file with json.load(filename)
, to be able to work with the JSON object within Python.
When i do
with open('adl://.../myJson1.json', 'r') as file:
jsonObject0 = json.load(file)
then i get the following error
[Errno 2] No such file or directory ‘adl://…/myJson1.json’
When i try (the mount point is correct, i can list the file and also with spark.read into a DataFrame)
jsonObject = json.load("/mnt/adls/data/myJson1.json")
then i get the following error
‘str’ object has no attribute ‘read’
I have no idea what to do else to get the JSON loaded. My goal is to read the JSON object and iterate through the keys and their values.
Answers:
The trick was to use the following syntax for the file url
/dbfs/mnt/adls/data/myJson1.json
i had to add /dbfs/...
respectively replace dbfs:/
with /dbfs/
at the beginning of the url.
Then i could use
with open('/dbfs/mnt/adls/ingress/marketo/update/leads/leads-json1.json', 'r') as f:
data = f.read()
jsonObject = json.loads(data)
Maybe it possible easier? But this works for now.
To keep the JSON style and work specifically with the JSON formatted data you can try loading the data in with the following code:
df = json.loads(dbutils.fs.head(fi.path))
To check the count of key value pairs:
print(len(df))
Then to loop through the key:values:
for obj in df:
email = df['email']
print(email)
Hope this helps.
I am using Azure Data Lake Store for storing simple JSON files with the following JSON:
{
"email": "[email protected]",
"id": "823956724385"
}
The json files name is myJson1.json
. The Azure Data Lake Store is mounted successfully to Azure Databricks.
I am able to load successfully the JSON file via
df = spark.read.option("multiline", "true").json(fi.path)
fi.path
is a FileInfo
Object which is the MyJson1.json
file from above.
When i do
spark.read.option("multiline", "true").json(fi.path)
df.show()`
i get the JSON object printed out correctly (DataFrame) as
+---------------------+------------+
| email| id|
+---------------------+------------+
|[email protected]|823956724385|
+---------------------+------------+
What i want to do is, to load the JSON file with json.load(filename)
, to be able to work with the JSON object within Python.
When i do
with open('adl://.../myJson1.json', 'r') as file:
jsonObject0 = json.load(file)
then i get the following error
[Errno 2] No such file or directory ‘adl://…/myJson1.json’
When i try (the mount point is correct, i can list the file and also with spark.read into a DataFrame)
jsonObject = json.load("/mnt/adls/data/myJson1.json")
then i get the following error
‘str’ object has no attribute ‘read’
I have no idea what to do else to get the JSON loaded. My goal is to read the JSON object and iterate through the keys and their values.
The trick was to use the following syntax for the file url
/dbfs/mnt/adls/data/myJson1.json
i had to add /dbfs/...
respectively replace dbfs:/
with /dbfs/
at the beginning of the url.
Then i could use
with open('/dbfs/mnt/adls/ingress/marketo/update/leads/leads-json1.json', 'r') as f:
data = f.read()
jsonObject = json.loads(data)
Maybe it possible easier? But this works for now.
To keep the JSON style and work specifically with the JSON formatted data you can try loading the data in with the following code:
df = json.loads(dbutils.fs.head(fi.path))
To check the count of key value pairs:
print(len(df))
Then to loop through the key:values:
for obj in df:
email = df['email']
print(email)
Hope this helps.