Dynamically create pyspark dataframes according to a condition
Question:
I have a pyspark dataframe store_df
:-
store
ID
Div
637
4000000970
Pac
637
4000000435
Pac
637
4000055542
Pac
637
4000042206
Pac
638
2200015935
Pac
638
2200000483
Pac
638
4000014114
Pac
640
4000000162
Pac
640
2200000067
Pac
642
2200000067
Mac
642
4000044148
Mac
642
4000014114
Mac
I want to remove ID
(present in store_df) from the dataframe final_list
dynamically for each store
in store_df based on Div
.
final_list
pyspark df :-
Div
ID
Rank
Category
Pac
4000000970
1
A
Pac
4000000432
2
A
Pac
4000000405
3
A
Pac
4000042431
4
A
Pac
2200028596
5
B
Pac
4000000032
6
A
Pac
2200028594
7
B
Pac
4000014114
8
B
Pac
2230001789
9
D
Pac
2200001789
10
C
Pac
2200001787
11
D
Pac
2200001786
12
C
Mac
2200001789
1
C
Mac
2200001787
2
D
Mac
2200001786
3
C
For eg:for store 637 the upd_final_list
should look like this(ID
4000000970 eliminated):-
Div
ID
Rank
Category
Pac
4000000432
2
A
Pac
4000000405
3
A
Pac
4000042431
4
A
Pac
2200028596
5
B
Pac
4000000032
6
A
Pac
2200028594
7
B
Pac
4000014114
8
B
Pac
2230001789
9
D
Pac
2200001789
10
C
Pac
2200001787
11
D
Pac
2200001786
12
C
Likewise this list is to be customised for other stores based on their ID
.
How do I do this?
Answers:
I can’t test it but it should be something like this if I understood it right now
store_ids = [637, 123, 865]
for store_id in store_ids:
div_type = stores.select("Div").where(f.col("ID") == store_id ).collect()[0][0]
final_list.join(stores, stores.ID == final_list.ID)
.select("*")
.where((f.col("Div") == div_type) &
(f.col("store_id") != store_id))
store_div = store_df.select('Store','Div').distinct().collect()
fc =0
for i in store_div:
store_filter = store_df.filter((col('Store')==i[0]) & (col('Div')==i[1]))
if fc == 0 :
Updated_final_list = final_list.join(store_filter, ["ID","DiV"], "left_anti")
else:
Updated_final_list = Updated_final_list.join(store_filter, ["ID","DiV"], "left_anti")
fc +=1
This works. Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks
stores=store_df.select("store").distinct().collect()
store_list = [ ele["store"] for ele in stores]
exploded_df = final_list.join(broadcast(store_list_added),['Div'],'left').withColumn('store',explode('store_list')).drop('store_list')
Updated_final_list = exploded_df.join(store_df,['store','ID'],'left_anti')
Updated_final_list=Updated_final_list.withColumn('Rank',col('Rank').cast('int')).withColumn("New_Rank",expr("row_number() over (partition by store order by Rank asc)")).drop("Rank")```
I have a pyspark dataframe store_df
:-
store | ID | Div |
---|---|---|
637 | 4000000970 | Pac |
637 | 4000000435 | Pac |
637 | 4000055542 | Pac |
637 | 4000042206 | Pac |
638 | 2200015935 | Pac |
638 | 2200000483 | Pac |
638 | 4000014114 | Pac |
640 | 4000000162 | Pac |
640 | 2200000067 | Pac |
642 | 2200000067 | Mac |
642 | 4000044148 | Mac |
642 | 4000014114 | Mac |
I want to remove ID
(present in store_df) from the dataframe final_list
dynamically for each store
in store_df based on Div
.
final_list
pyspark df :-
Div | ID | Rank | Category |
---|---|---|---|
Pac | 4000000970 | 1 | A |
Pac | 4000000432 | 2 | A |
Pac | 4000000405 | 3 | A |
Pac | 4000042431 | 4 | A |
Pac | 2200028596 | 5 | B |
Pac | 4000000032 | 6 | A |
Pac | 2200028594 | 7 | B |
Pac | 4000014114 | 8 | B |
Pac | 2230001789 | 9 | D |
Pac | 2200001789 | 10 | C |
Pac | 2200001787 | 11 | D |
Pac | 2200001786 | 12 | C |
Mac | 2200001789 | 1 | C |
Mac | 2200001787 | 2 | D |
Mac | 2200001786 | 3 | C |
For eg:for store 637 the upd_final_list
should look like this(ID
4000000970 eliminated):-
Div | ID | Rank | Category |
---|---|---|---|
Pac | 4000000432 | 2 | A |
Pac | 4000000405 | 3 | A |
Pac | 4000042431 | 4 | A |
Pac | 2200028596 | 5 | B |
Pac | 4000000032 | 6 | A |
Pac | 2200028594 | 7 | B |
Pac | 4000014114 | 8 | B |
Pac | 2230001789 | 9 | D |
Pac | 2200001789 | 10 | C |
Pac | 2200001787 | 11 | D |
Pac | 2200001786 | 12 | C |
Likewise this list is to be customised for other stores based on their ID
.
How do I do this?
I can’t test it but it should be something like this if I understood it right now
store_ids = [637, 123, 865]
for store_id in store_ids:
div_type = stores.select("Div").where(f.col("ID") == store_id ).collect()[0][0]
final_list.join(stores, stores.ID == final_list.ID)
.select("*")
.where((f.col("Div") == div_type) &
(f.col("store_id") != store_id))
store_div = store_df.select('Store','Div').distinct().collect()
fc =0
for i in store_div:
store_filter = store_df.filter((col('Store')==i[0]) & (col('Div')==i[1]))
if fc == 0 :
Updated_final_list = final_list.join(store_filter, ["ID","DiV"], "left_anti")
else:
Updated_final_list = Updated_final_list.join(store_filter, ["ID","DiV"], "left_anti")
fc +=1
This works. Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks
stores=store_df.select("store").distinct().collect()
store_list = [ ele["store"] for ele in stores]
exploded_df = final_list.join(broadcast(store_list_added),['Div'],'left').withColumn('store',explode('store_list')).drop('store_list')
Updated_final_list = exploded_df.join(store_df,['store','ID'],'left_anti')
Updated_final_list=Updated_final_list.withColumn('Rank',col('Rank').cast('int')).withColumn("New_Rank",expr("row_number() over (partition by store order by Rank asc)")).drop("Rank")```