PySpark in Databricks error with table conversion to pandas
Question:
I’m using Databricks and want to convert my PySpark DataFrame to a pandas
one using the df.toPandas() command.
However, I keep getting this error:
/databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
'DataFrame' object has no attribute 'dtype'
warnings.warn(msg)
AttributeError: 'DataFrame' object has no attribute 'dtype'
I tried different things, including:
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
But nothing worked so far (I also checked some of the other posts that have this issue, but none helped).
UPDATE: result of df.printSchema()
:
flight_id: string (nullable = true)
|-- flight_direction: string (nullable = true)
|-- service_type: string (nullable = true)
|-- flight_designator: string (nullable = true)
|-- flight_number: string (nullable = true)
|-- callsign: string (nullable = true)
|-- scheduled_datetime: timestamp (nullable = true)
|-- connecting_flight_designator: string (nullable = true)
|-- airport_iata_codes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- airline_name: string (nullable = true)
|-- airport_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- country_number: long (nullable = true)
|-- eu_category: string (nullable = true)
|-- safe_town_indicator: boolean (nullable = true)
|-- sibt: timestamp (nullable = true)
|-- aibt: timestamp (nullable = true)
|-- sobt: timestamp (nullable = true)
|-- aibt: timestamp (nullable = true)
|-- tsat: timestamp (nullable = true)
|-- aircraft_name: string (nullable = true)
|-- aircraft_registration: string (nullable = true)
|-- ramp: string (nullable = true)
|-- ramp_previous: string (nullable = true)
|-- seats: long (nullable = true)
|-- actual_total_pax: integer (nullable = true)
|-- handler_apron: string (nullable = true)
|-- occupancy_rate: double (nullable = false)
Answers:
There was a problem in the data filtering. There were duplicate columns. If anyone in the future has a similar issue, please check this.
I’m using Databricks and want to convert my PySpark DataFrame to a pandas
one using the df.toPandas() command.
However, I keep getting this error:
/databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
'DataFrame' object has no attribute 'dtype'
warnings.warn(msg)
AttributeError: 'DataFrame' object has no attribute 'dtype'
I tried different things, including:
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
But nothing worked so far (I also checked some of the other posts that have this issue, but none helped).
UPDATE: result of df.printSchema()
:
flight_id: string (nullable = true)
|-- flight_direction: string (nullable = true)
|-- service_type: string (nullable = true)
|-- flight_designator: string (nullable = true)
|-- flight_number: string (nullable = true)
|-- callsign: string (nullable = true)
|-- scheduled_datetime: timestamp (nullable = true)
|-- connecting_flight_designator: string (nullable = true)
|-- airport_iata_codes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- airline_name: string (nullable = true)
|-- airport_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- country_number: long (nullable = true)
|-- eu_category: string (nullable = true)
|-- safe_town_indicator: boolean (nullable = true)
|-- sibt: timestamp (nullable = true)
|-- aibt: timestamp (nullable = true)
|-- sobt: timestamp (nullable = true)
|-- aibt: timestamp (nullable = true)
|-- tsat: timestamp (nullable = true)
|-- aircraft_name: string (nullable = true)
|-- aircraft_registration: string (nullable = true)
|-- ramp: string (nullable = true)
|-- ramp_previous: string (nullable = true)
|-- seats: long (nullable = true)
|-- actual_total_pax: integer (nullable = true)
|-- handler_apron: string (nullable = true)
|-- occupancy_rate: double (nullable = false)
There was a problem in the data filtering. There were duplicate columns. If anyone in the future has a similar issue, please check this.