pyspark-sql

When clause in pyspark gives an error "name 'when' is not defined"

When clause in pyspark gives an error "name 'when' is not defined" Question: With the below code I am getting an error message, name ‘when’ is not defined. voter_df = voter_df.withColumn(‘random_val’, when(voter_df.TITLE == ‘Councilmember’, F.rand()) .when(voter_df.TITLE == ‘Mayor’, 2) .otherwise(0)) Add a column to voter_df named random_val with the results of the F.rand() method for …

Total answers: 2

Spatial Join between pyspark dataframe and polygons (geopandas)

Spatial Join between pyspark dataframe and polygons (geopandas) Question: Problem : I would like to make a spatial join between: A big Spark Dataframe (500M rows) with points (eg. points on a road) a small geojson (20000 shapes) with polygons (eg. regions boundaries). Here is what I have so far, which I find to be …

Total answers: 2

Split count results of different events into different columns in pyspark

Split count results of different events into different columns in pyspark Question: I have a rdd from which I need to extract counts of multiple events. The initial rdd looks like this +———-+——————–+——————-+ | event| user| day| +———-+——————–+——————-+ |event_x |user_A | 0| |event_y |user_A | 2| |event_x |user_B | 2| |event_y |user_B | 1| |event_x …

Total answers: 2

How to determine what are the columns I need since ApplyMapping is'nt case sensitive?

How to determine what are the columns I need since ApplyMapping is'nt case sensitive? Question: I’m updating a Pyspark script with a new Database model and I’ve encountered some problems calling/updating columns since PySpark apparently brings all columns in uppercase but when I use ApplyMapping it is not case sensitive BUT when I join(By left) …

Total answers: 2

How do I convert timestamp to unix format with pyspark

How do I convert timestamp to unix format with pyspark Question: I have a dataframe with timestamp values, like this one: 2018-02-15T11:39:13.000Z I want to have it in UNIX format, using Pyspark. I tried something like data = datasample.withColumn(‘timestamp_cast’, datasample[‘timestamp’].cast(‘date’)) but I lose a lot of information, since I only get day/month/year when I have …

Total answers: 2

Pyspark- Subquery in a case statement

Pyspark- Subquery in a case statement Question: I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table. Is this even possible in pyspark? temp_df=spark.sql("select *, case when …

Total answers: 1

PySpark: How to judge column type of dataframe

PySpark: How to judge column type of dataframe Question: Suppose we have a dataframe called df. I know there is way of using df.dtypes. However I prefer something similar to type(123) == int # note here the int is not a string I wonder is there is something like: type(df.select(<column_name>).collect()[0][1]) == IntegerType Basically I want …

Total answers: 3

Pyspark convert a standard list to data frame

Pyspark convert a standard list to data frame Question: The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField(“value”, IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df …

Total answers: 2

Spark SQL search inside an array for a struct

Spark SQL search inside an array for a struct Question: My data structure is defined approximately as follows: schema = StructType([ # … fields skipped StructField(“extra_features”, ArrayType(StructType([ StructField(“key”, StringType(), False), StructField(“value”, StringType(), True) ])), nullable = False)], ) Now, I’d like to search for entries in a data frame where a struct {“key”: “somekey”, “value”: …

Total answers: 2

Filtering a pyspark dataframe using isin by exclusion

Filtering a pyspark dataframe using isin by exclusion Question: I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: df = sqlContext.createDataFrame([(‘1′,’a’),(‘2′,’b’),(‘3′,’b’),(‘4′,’c’),(‘5′,’d’)] ,schema=(‘id’,’bar’)) I get the data frame: +—+—+ | id|bar| +—+—+ | 1| a| | 2| b| | …

Total answers: 4