Dynamically create pyspark dataframes according to a condition

Question:

I have a pyspark dataframe store_df :-

store ID Div
637 4000000970 Pac
637 4000000435 Pac
637 4000055542 Pac
637 4000042206 Pac
638 2200015935 Pac
638 2200000483 Pac
638 4000014114 Pac
640 4000000162 Pac
640 2200000067 Pac
642 2200000067 Mac
642 4000044148 Mac
642 4000014114 Mac

I want to remove ID(present in store_df) from the dataframe final_list dynamically for each store in store_df based on Div.

final_list pyspark df :-

Div ID Rank Category
Pac 4000000970 1 A
Pac 4000000432 2 A
Pac 4000000405 3 A
Pac 4000042431 4 A
Pac 2200028596 5 B
Pac 4000000032 6 A
Pac 2200028594 7 B
Pac 4000014114 8 B
Pac 2230001789 9 D
Pac 2200001789 10 C
Pac 2200001787 11 D
Pac 2200001786 12 C
Mac 2200001789 1 C
Mac 2200001787 2 D
Mac 2200001786 3 C

For eg:for store 637 the upd_final_list should look like this(ID 4000000970 eliminated):-

Div ID Rank Category
Pac 4000000432 2 A
Pac 4000000405 3 A
Pac 4000042431 4 A
Pac 2200028596 5 B
Pac 4000000032 6 A
Pac 2200028594 7 B
Pac 4000014114 8 B
Pac 2230001789 9 D
Pac 2200001789 10 C
Pac 2200001787 11 D
Pac 2200001786 12 C

Likewise this list is to be customised for other stores based on their ID.
How do I do this?

Asked By: Scope

||

Answers:

I can’t test it but it should be something like this if I understood it right now

store_ids = [637, 123, 865]
for store_id in store_ids: 
   div_type = stores.select("Div").where(f.col("ID") == store_id ).collect()[0][0]
   final_list.join(stores, stores.ID == final_list.ID)
       .select("*")
       .where((f.col("Div") == div_type) &
              (f.col("store_id") != store_id))
Answered By: Axeltherabbit
store_div = store_df.select('Store','Div').distinct().collect()

fc =0
for i in store_div: 

  store_filter = store_df.filter((col('Store')==i[0]) & (col('Div')==i[1]))
  if fc == 0 :
      Updated_final_list = final_list.join(store_filter, ["ID","DiV"], "left_anti")
  else:
      Updated_final_list = Updated_final_list.join(store_filter, ["ID","DiV"], "left_anti")

  fc +=1
Answered By: Karuppaiya

This works. Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks

stores=store_df.select("store").distinct().collect()
store_list = [ ele["store"] for ele in stores]
exploded_df = final_list.join(broadcast(store_list_added),['Div'],'left').withColumn('store',explode('store_list')).drop('store_list')
Updated_final_list = exploded_df.join(store_df,['store','ID'],'left_anti')
Updated_final_list=Updated_final_list.withColumn('Rank',col('Rank').cast('int')).withColumn("New_Rank",expr("row_number() over (partition by store order by Rank asc)")).drop("Rank")```
Answered By: Scope
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.