How to read a file stored in adls gen 2 using pandas?
Question:
I am trying to read a parquet file through pandas in databricks notebook. The cluster has permission to access adls.
import pandas as pd
pdf = pd.read_parquet("abfss://abc.parquet")
But pandas is not able to read it and throws the below error.
ValueError Traceback (most recent call last)
<command-2342282971496650> in <module>
1 import pandas as pd
2 parquet_file = 'abfss://abc.parquet'
----> 3 pd.read_parquet(parquet_file)
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
457 """
458 impl = get_engine(engine)
--> 459 return impl.read(
460 path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
461 )
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
212 )
213
--> 214 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
215 path,
216 kwargs.pop("filesystem", None),
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
64 fsspec = import_optional_dependency("fsspec")
65
---> 66 fs, path_or_handle = fsspec.core.url_to_fs(
67 path_or_handle, **(storage_options or {})
68 )
/databricks/python/lib/python3.8/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
369 else:
370 protocol = split_protocol(url)[0]
--> 371 cls = get_filesystem_class(protocol)
372
373 options = cls._get_kwargs_from_urls(url)
/databricks/python/lib/python3.8/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
206 if protocol not in registry:
207 if protocol not in known_implementations:
--> 208 raise ValueError("Protocol not known: %s" % protocol)
209 bit = known_implementations[protocol]
210 try:
ValueError: Protocol not known: abfss
I tried a workaround to do this.
import pandas as pd
import pyspark.pandas as ps
pdf = ps.read_parquet("abfss://abc.parquet").to_pandas()
The above query actually takes a lot of time in converting the pyspark.pandas dataframe to pandas dataframe.
NOTE: I cannot mount the adls to dbfs because dbfs is disabled by the platform team and hence all the operations need to be done on adls.
I am looking for a faster way or a simpler way to read files from adls gen2 using python pandas.
Any leads would be highly appreciated.
Answers:
Finally the problem is resolved, and now I am able to read the data in adls using pandas library. No need of spark or koalas conversion.
pd.read_parquet("file_path", storage_options = "")
Follow this article, for storage_options.
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool
Databricks released pandas on spark. This replaces the prior koalas library.
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
A search of the documentation shows how to read a parquet file.
Let’s assume that all things are equal, is there more overhead using pandas vs spark dataframes. Yes, I have used the libraries in the past when there were functions that I wanted to use that were only in a pandas dataframe, not a pyspark sql dataframe.
I have some existing parquet files from the adventures database (SQL Server) that were saved in the data lake as parquet files.
The pandas code takes 0.84 seconds to return. We can see the type is pyspark.pandas.frame.Dataframe.
The native spark code takes 0.66 seconds to return. We can see the type is pyspark.sql.dataframe.Dataframe.
To recap, the idea that pandas would be faster is probably not true. The code is almost the same in design. These files are small. I wonder what the timings would be for larger files …
I am trying to read a parquet file through pandas in databricks notebook. The cluster has permission to access adls.
import pandas as pd
pdf = pd.read_parquet("abfss://abc.parquet")
But pandas is not able to read it and throws the below error.
ValueError Traceback (most recent call last)
<command-2342282971496650> in <module>
1 import pandas as pd
2 parquet_file = 'abfss://abc.parquet'
----> 3 pd.read_parquet(parquet_file)
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
457 """
458 impl = get_engine(engine)
--> 459 return impl.read(
460 path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
461 )
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
212 )
213
--> 214 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
215 path,
216 kwargs.pop("filesystem", None),
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
64 fsspec = import_optional_dependency("fsspec")
65
---> 66 fs, path_or_handle = fsspec.core.url_to_fs(
67 path_or_handle, **(storage_options or {})
68 )
/databricks/python/lib/python3.8/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
369 else:
370 protocol = split_protocol(url)[0]
--> 371 cls = get_filesystem_class(protocol)
372
373 options = cls._get_kwargs_from_urls(url)
/databricks/python/lib/python3.8/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
206 if protocol not in registry:
207 if protocol not in known_implementations:
--> 208 raise ValueError("Protocol not known: %s" % protocol)
209 bit = known_implementations[protocol]
210 try:
ValueError: Protocol not known: abfss
I tried a workaround to do this.
import pandas as pd
import pyspark.pandas as ps
pdf = ps.read_parquet("abfss://abc.parquet").to_pandas()
The above query actually takes a lot of time in converting the pyspark.pandas dataframe to pandas dataframe.
NOTE: I cannot mount the adls to dbfs because dbfs is disabled by the platform team and hence all the operations need to be done on adls.
I am looking for a faster way or a simpler way to read files from adls gen2 using python pandas.
Any leads would be highly appreciated.
Finally the problem is resolved, and now I am able to read the data in adls using pandas library. No need of spark or koalas conversion.
pd.read_parquet("file_path", storage_options = "")
Follow this article, for storage_options.
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool
Databricks released pandas on spark. This replaces the prior koalas library.
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
A search of the documentation shows how to read a parquet file.
Let’s assume that all things are equal, is there more overhead using pandas vs spark dataframes. Yes, I have used the libraries in the past when there were functions that I wanted to use that were only in a pandas dataframe, not a pyspark sql dataframe.
I have some existing parquet files from the adventures database (SQL Server) that were saved in the data lake as parquet files.
The pandas code takes 0.84 seconds to return. We can see the type is pyspark.pandas.frame.Dataframe.
The native spark code takes 0.66 seconds to return. We can see the type is pyspark.sql.dataframe.Dataframe.
To recap, the idea that pandas would be faster is probably not true. The code is almost the same in design. These files are small. I wonder what the timings would be for larger files …