How to Decode GEOHASH Column using PySpark
Question:
I’m trying to decode the GEOHASH to Latitude and Longitude using the pygeohash library. Below is my code
import pygeohash as pgh
from pyspark.sql.types import StringType
udf1 = udf(lambda x: pgh.decode(x))
add_latlong = add.withColumn('location', udf1(col('GEOHASH')))
However, I’m getting the result below:
+------------+--------------------+
| GEOHASH| location|
+------------+--------------------+
|w284nyv39qzn|[Ljava.lang.Objec...|
|w0zqyr64nt4v|[Ljava.lang.Objec...|
|w2815pb0yfgr|[Ljava.lang.Objec...|
|w281xv1czv1t|[Ljava.lang.Objec...|
|w2r7cvc0m1bz|[Ljava.lang.Objec...|
+------------+--------------------+
I’ve come across this thread PySpark UDF Returns [Ljava.lang.Object;@] that mentioned to use StringType as the second parameter of the udf but I’m still seeing the same result as above. How do I get the latitude and longitude from here?
Appreciate your help
Update: I’ve used the solution from Jonathan Lam below and for completeness here’s the code and dataframe.
udf1 = udf(lambda x: pgh.decode(x), ArrayType(FloatType()))
add_latlong = add.withColumn('location', udf1(col('GEOHASH'))).withColumn('Lat',col('location')[0]).withColumn('Long',col('location')[1])
+------------+--------------------+--------+----------+
| GEOHASH| location| lat| long|
+------------+--------------------+--------+----------+
|w2864utg8uyf|[3.189408, 101.73...|3.189408| 101.73035|
|w281hj25hzre|[3.017675, 101.42...|3.017675|101.425995|
|w2830hj8vzrp|[3.010423, 101.60...|3.010423|101.609375|
|w0zf5uepz8uk|[4.596367, 101.06...|4.596367| 101.06768|
|w2rkk6s97gvt|[2.167289, 111.63...|2.167289| 111.63843|
+------------+--------------------+--------+----------+
Answers:
I’m not sure if your case is the same as the link you provided, since you are using external package to do the transformation pgh.decode(x)
. Based on the docs:
pgh.decode(geohash='ezs42')
# >>> ('42.6', '-5.6')
I think you should use ArrayType(FloatType())
instead.
I’m trying to decode the GEOHASH to Latitude and Longitude using the pygeohash library. Below is my code
import pygeohash as pgh
from pyspark.sql.types import StringType
udf1 = udf(lambda x: pgh.decode(x))
add_latlong = add.withColumn('location', udf1(col('GEOHASH')))
However, I’m getting the result below:
+------------+--------------------+
| GEOHASH| location|
+------------+--------------------+
|w284nyv39qzn|[Ljava.lang.Objec...|
|w0zqyr64nt4v|[Ljava.lang.Objec...|
|w2815pb0yfgr|[Ljava.lang.Objec...|
|w281xv1czv1t|[Ljava.lang.Objec...|
|w2r7cvc0m1bz|[Ljava.lang.Objec...|
+------------+--------------------+
I’ve come across this thread PySpark UDF Returns [Ljava.lang.Object;@] that mentioned to use StringType as the second parameter of the udf but I’m still seeing the same result as above. How do I get the latitude and longitude from here?
Appreciate your help
Update: I’ve used the solution from Jonathan Lam below and for completeness here’s the code and dataframe.
udf1 = udf(lambda x: pgh.decode(x), ArrayType(FloatType()))
add_latlong = add.withColumn('location', udf1(col('GEOHASH'))).withColumn('Lat',col('location')[0]).withColumn('Long',col('location')[1])
+------------+--------------------+--------+----------+
| GEOHASH| location| lat| long|
+------------+--------------------+--------+----------+
|w2864utg8uyf|[3.189408, 101.73...|3.189408| 101.73035|
|w281hj25hzre|[3.017675, 101.42...|3.017675|101.425995|
|w2830hj8vzrp|[3.010423, 101.60...|3.010423|101.609375|
|w0zf5uepz8uk|[4.596367, 101.06...|4.596367| 101.06768|
|w2rkk6s97gvt|[2.167289, 111.63...|2.167289| 111.63843|
+------------+--------------------+--------+----------+
I’m not sure if your case is the same as the link you provided, since you are using external package to do the transformation pgh.decode(x)
. Based on the docs:
pgh.decode(geohash='ezs42')
# >>> ('42.6', '-5.6')
I think you should use ArrayType(FloatType())
instead.