Dynamically create pyspark dataframes according to a condition

Question

I have a pyspark dataframe store_df :-

store	ID	Div
637	4000000970	Pac
637	4000000435	Pac
637	4000055542	Pac
637	4000042206	Pac
638	2200015935	Pac
638	2200000483	Pac
638	4000014114	Pac
640	4000000162	Pac
640	2200000067	Pac
642	2200000067	Mac
642	4000044148	Mac
642	4000014114	Mac

I want to remove ID(present in store_df) from the dataframe final_list dynamically for each store in store_df based on Div.

final_list pyspark df :-

Div	ID	Rank	Category
Pac	4000000970	1	A
Pac	4000000432	2	A
Pac	4000000405	3	A
Pac	4000042431	4	A
Pac	2200028596	5	B
Pac	4000000032	6	A
Pac	2200028594	7	B
Pac	4000014114	8	B
Pac	2230001789	9	D
Pac	2200001789	10	C
Pac	2200001787	11	D
Pac	2200001786	12	C
Mac	2200001789	1	C
Mac	2200001787	2	D
Mac	2200001786	3	C

For eg:for store 637 the upd_final_list should look like this(ID 4000000970 eliminated):-

Div	ID	Rank	Category
Pac	4000000432	2	A
Pac	4000000405	3	A
Pac	4000042431	4	A
Pac	2200028596	5	B
Pac	4000000032	6	A
Pac	2200028594	7	B
Pac	4000014114	8	B
Pac	2230001789	9	D
Pac	2200001789	10	C
Pac	2200001787	11	D
Pac	2200001786	12	C

Likewise this list is to be customised for other stores based on their ID.
How do I do this?

Asked By: Scope

||

Source

Answer 1

I can’t test it but it should be something like this if I understood it right now

store_ids = [637, 123, 865]
for store_id in store_ids: 
   div_type = stores.select("Div").where(f.col("ID") == store_id ).collect()[0][0]
   final_list.join(stores, stores.ID == final_list.ID)
       .select("*")
       .where((f.col("Div") == div_type) &
              (f.col("store_id") != store_id))

Answered By: Axeltherabbit

Answer 2

store_div = store_df.select('Store','Div').distinct().collect()

fc =0
for i in store_div: 

  store_filter = store_df.filter((col('Store')==i[0]) & (col('Div')==i[1]))
  if fc == 0 :
      Updated_final_list = final_list.join(store_filter, ["ID","DiV"], "left_anti")
  else:
      Updated_final_list = Updated_final_list.join(store_filter, ["ID","DiV"], "left_anti")

  fc +=1

Answered By: Karuppaiya

Answer 3

This works. Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks

stores=store_df.select("store").distinct().collect()
store_list = [ ele["store"] for ele in stores]
exploded_df = final_list.join(broadcast(store_list_added),['Div'],'left').withColumn('store',explode('store_list')).drop('store_list')
Updated_final_list = exploded_df.join(store_df,['store','ID'],'left_anti')
Updated_final_list=Updated_final_list.withColumn('Rank',col('Rank').cast('int')).withColumn("New_Rank",expr("row_number() over (partition by store order by Rank asc)")).drop("Rank")```

Answered By: Scope

Dynamically create pyspark dataframes according to a condition

Question:

Answers: