How to read a file stored in adls gen 2 using pandas?

Question:

I am trying to read a parquet file through pandas in databricks notebook. The cluster has permission to access adls.

import pandas as pd 
pdf = pd.read_parquet("abfss://abc.parquet")

But pandas is not able to read it and throws the below error.

ValueError                                Traceback (most recent call last)
<command-2342282971496650> in <module>
  1 import pandas as pd
  2 parquet_file = 'abfss://abc.parquet'
  ----> 3 pd.read_parquet(parquet_file)

  /databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
457     """
458     impl = get_engine(engine)
--> 459     return impl.read(
460         path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
461     )

/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
212                 )
213 
--> 214         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
215             path,
216             kwargs.pop("filesystem", None),

/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
 64         fsspec = import_optional_dependency("fsspec")
 65 
 ---> 66         fs, path_or_handle = fsspec.core.url_to_fs(
 67             path_or_handle, **(storage_options or {})
 68         )

 /databricks/python/lib/python3.8/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
369     else:
370         protocol = split_protocol(url)[0]
--> 371         cls = get_filesystem_class(protocol)
372 
373         options = cls._get_kwargs_from_urls(url)

/databricks/python/lib/python3.8/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
206     if protocol not in registry:
207         if protocol not in known_implementations:
--> 208             raise ValueError("Protocol not known: %s" % protocol)
209         bit = known_implementations[protocol]
210         try:

ValueError: Protocol not known: abfss

I tried a workaround to do this.

import pandas as pd
import pyspark.pandas as ps 
pdf = ps.read_parquet("abfss://abc.parquet").to_pandas() 

The above query actually takes a lot of time in converting the pyspark.pandas dataframe to pandas dataframe.

NOTE: I cannot mount the adls to dbfs because dbfs is disabled by the platform team and hence all the operations need to be done on adls.

I am looking for a faster way or a simpler way to read files from adls gen2 using python pandas.

Any leads would be highly appreciated.

Asked By: user19930511

||

Answers:

Finally the problem is resolved, and now I am able to read the data in adls using pandas library. No need of spark or koalas conversion.

pd.read_parquet("file_path", storage_options = "")

Follow this article, for storage_options.

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool

Answered By: user19930511

Databricks released pandas on spark. This replaces the prior koalas library.

https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

A search of the documentation shows how to read a parquet file.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_parquet.html

Let’s assume that all things are equal, is there more overhead using pandas vs spark dataframes. Yes, I have used the libraries in the past when there were functions that I wanted to use that were only in a pandas dataframe, not a pyspark sql dataframe.

enter image description here

I have some existing parquet files from the adventures database (SQL Server) that were saved in the data lake as parquet files.

enter image description here

The pandas code takes 0.84 seconds to return. We can see the type is pyspark.pandas.frame.Dataframe.

enter image description here

The native spark code takes 0.66 seconds to return. We can see the type is pyspark.sql.dataframe.Dataframe.

To recap, the idea that pandas would be faster is probably not true. The code is almost the same in design. These files are small. I wonder what the timings would be for larger files …

Answered By: CRAFTY DBA