Pyspark multiply only some Column Values when condition is met, otherwise keep the same value
Question:
I have the following dataset
id col1 ... col10 quantity
0 2 3 0
1 1 4 2
2 0 4 2
3 2 2 0
I would like to multiply the values of col1 to col10 by 2 only when quantity is equal 2, otherwise I would like to keep the previous value. Here is an example of the result:
id col1 ... col10 quantity
0 2 3 0
1 2 8 2
2 0 8 2
3 2 2 0
I wrote the following code for now:
cols_names = df.drop('id','quantity').columns
df = df.withColumn("arr", F.when(F.col('quantity') == 2, F.struct(*[(F.col(x)* 2).alias(x) for x in
cols_names]))).select("id","quantity","arr.*")
The only problem with this approach is that when the condition is not met I get null values instead of keeping the old one. How can I keep the old value when the condition is not met? Or if there is an easier way to do that it would be great too.
Answers:
you need to use the otherwise
clause with the when
clause. If you don’t give the otherwise
clause, it’s gonna take the default values as None
in case of a non-matching condition.
df = df.withColumn("arr", F.when(F.col('quantity') == 2, F.struct(*[(F.col(x)* 2).alias(x) for x in cols_names])).otherwise(F.struct(*[(F.col(x)).alias(x) for x in cols_names]))).select("id","quantity","arr.*")
I have the following dataset
id col1 ... col10 quantity
0 2 3 0
1 1 4 2
2 0 4 2
3 2 2 0
I would like to multiply the values of col1 to col10 by 2 only when quantity is equal 2, otherwise I would like to keep the previous value. Here is an example of the result:
id col1 ... col10 quantity
0 2 3 0
1 2 8 2
2 0 8 2
3 2 2 0
I wrote the following code for now:
cols_names = df.drop('id','quantity').columns
df = df.withColumn("arr", F.when(F.col('quantity') == 2, F.struct(*[(F.col(x)* 2).alias(x) for x in
cols_names]))).select("id","quantity","arr.*")
The only problem with this approach is that when the condition is not met I get null values instead of keeping the old one. How can I keep the old value when the condition is not met? Or if there is an easier way to do that it would be great too.
you need to use the otherwise
clause with the when
clause. If you don’t give the otherwise
clause, it’s gonna take the default values as None
in case of a non-matching condition.
df = df.withColumn("arr", F.when(F.col('quantity') == 2, F.struct(*[(F.col(x)* 2).alias(x) for x in cols_names])).otherwise(F.struct(*[(F.col(x)).alias(x) for x in cols_names]))).select("id","quantity","arr.*")