Python Pandas read csv from DataLake

Question

I’m trying to read a csv file that is stored on a Azure Data Lake Gen 2, Python runs in Databricks.
Here are 2 lines of code, the first one works, the seconds one fails.
Do I really have to mount the Adls to have Pandas being able to access it.

data1 = spark.read.option("header",False).format("csv").load("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")
data2 = pd.read_csv("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")

Any suggestions ?

Asked By: Harry Leboeuf

||

Source

Answer 1

Pandas doesn’t know about cloud storage, and works with local files only. On Databricks you should be able to copy the file locally, so you can open it with Pandas. This could be done either with %fs cp abfss://.... file:/your-location or with dbutils.fs.cp("abfss://....", "file:/your-location") (see docs).

Another possibility is instead of Pandas, use the Koalas library that provides Pandas-compatible API on top of the Spark. Besides ability to access data in the cloud, you’ll also get a possibility to run your code in the distributed fashion.

Answered By: Alex Ott

Answer 2

I could solve it by mounting the cloud storage as a drive. Works fine now.

Answered By: Harry Leboeuf

Python Pandas read csv from DataLake

Question:

Answers: