Databricks: Issue while creating spark data frame from pandas

Question:

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my current pandas version is 2.0.0 and also I tried to install lesser version and tried to created spark df but I still get the same error. The error invokes inside the spark function. What is the solution for this? which pandas version should I install in order to create spark df. I also tried to change the runtime of cluster databricks and tried re running but I still get the same error.

import pandas as pd
spark.createDataFrame(pd.DataFrame({'i':[1,2,3],'j':[1,2,3]}))

error:-
UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  'DataFrame' object has no attribute 'iteritems'
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
AttributeError: 'DataFrame' object has no attribute 'iteritems'
Asked By: data en

||

Answers:

The Arrow optimization is failing because of the missing ‘iteritems’ attribut.
You should try disabling the Arrow optimization in your Spark session and create the DataFrame without Arrow optimization.

Here is how it would work:

import pandas as pd
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder 
    .appName("Pandas to Spark DataFrame") 
    .getOrCreate()

# Disable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")

# Create a pandas DataFrame
pdf = pd.DataFrame({'i': [1, 2, 3], 'j': [1, 2, 3]})

# Convert pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Show the Spark DataFrame
sdf.show()

It should work but also if you want you can downgrade your pandas version for the Arrow optimisation like that pip install pandas==1.2.5

Answered By: Saxtheowl

It’s related to the Databricks Runtime (DBR) version used – the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

If you can’t upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it’s just better to use Pandas version shipped with your DBR – it was tested for compatibility with other packages in DBR.

Answered By: Alex Ott