apache-spark-sql

Explode dates and backfill rows in pyspark dataframe

Explode dates and backfill rows in pyspark dataframe Question: I have this dataframe: +—+———-+——+ | id| date|amount| +—+———-+——+ |123|2022-11-11|100.00| |123|2022-11-12|100.00| |123|2022-11-13|100.00| |123|2022-11-14|200.00| |456|2022-11-14|300.00| |456|2022-11-15|300.00| |456|2022-11-16|300.00| |789|2022-11-11|400.00| |789|2022-11-12|500.00| +—+———-+——+ I need to create new records for each date until current_date() – 2. And the value that will be populated must be the most recent one. For …

Total answers: 1

Fill nulls with values from another column in PySpark

Fill nulls with values from another column in PySpark Question: I have a dataset col_id col_2 col_3 col_id_b ABC111 shfhs 34775 null ABC112 shfhe 34775 DEF345 ABC112 shfhs 34775 GFR563 ABC112 shfgh 34756 TRS572 ABC113 shfdh 34795 null ABC114 shfhs 34770 null I am trying to create a new column that is identical to col_id_b, …

Total answers: 1

Fill column value based on join in Pyspark dataframe

Fill column value based on join in Pyspark dataframe Question: I have a dataframe using the code df = sc.parallelize([ (123, 2345,25,""), (123, 2345,29,"NY"), (123,5422,67,"NY"),(123,9422,67,"NY"),(123,3581,98,"NY"),(231, 4322,77,""),(231,4322,99,"Paris"),(231,8342,45,"Paris") ]).toDF(["userid", "transactiontime","zip","location"]) +——+—————+—+——–+ |userid|transactiontime|zip|location| +——+—————+—+——–+ | 123| 2345| 25| | | 123| 2345| 29| NY| | 123| 5422| 67| NY| | 123| 9422| 67| NY| | 123| 3581| 98| …

Total answers: 1

Do consecutive window functions with the same partitioning cause additional shuffles?

Do consecutive window functions with the same partitioning cause additional shuffles? Question: Suppose I have two different windows with the same partitioning: window1 = Window.partitionBy("id") window2 = Window.partitionBy("id").orderBy("date") And then I call several consecutive window functions using them: df.withColumn("col1", F.sum("x").over(window1)) .withColumn("col2", F.first("x").over(window2)) And suppose df is not partitioned by id. Will the computation of col2 …

Total answers: 1

PySpark dataframe : Add new column For Each Unique ID and Column Condition

PySpark dataframe : Add new column For Each Unique ID and Column Condition Question: I am trying to assign value of 1 in new column "new_col" with condition based in other column and id column. Here’s my dataframe: I’d like to add a new column, that would get 1 if "l1" or "l3" is in …

Total answers: 1

How to use SparkSQL Function inside DataFrame Where/Filter condition?

How to use SparkSQL Function inside DataFrame Where/Filter condition? Question: I’m using PySpark. For example, I have a simple DataFrame "df" with 1 column "Col1" which contains lots of blank spaces as below: Col1 " – " "abc " " xy" I want to take all the rows that are not "-" after trim. In …

Total answers: 1

Remove any row with at least 1 NA with PySpark

Remove any row with at least 1 NA with PySpark Question: I have a pyspark dataframe and I would like to remove any row countaining at least one NA. I know how to do so only for one column (code below). How to do the same for all columns of the dataframe? Reproducible example # …

Total answers: 2

PySpark – assigning group id based on group member count

PySpark – assigning group id based on group member count Question: I have a dataframe where I want to assign id in for each window partition and for each 5 rows. Meaning, the id should increase/change when the partition has a different value or the number of rows in a partition is more than 5. …

Total answers: 2

Calendarized cost by year and month in Spark

Calendarized cost by year and month in Spark Question: I am fairly new to PySpark and looking for the best way to perform the following calculations: I have the following data frame: +————-+————+————–+————+————+—–+ |invoice_month|invoice_year|start_date_key|end_date_key|invoice_days| cost| +————-+————+————–+————+————+—–+ | 11| 2007| 20071022| 20071120| 30| 100| | 12| 2007| 20071121| 20071220| 30| 160| | 5| 2014| 20140423| 20140522| …

Total answers: 2