How to subtract all column values of two PySpark dataframe?
Question:
Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:
col1 col2 ... col100
1 2 ... 100
df2:
col1 col2 ... col100
5 4 ... 20
And I want to get the final dataframe with df1 – df2 :
new df:
col1 col2 ... col100
-4 -2 ... 80
I checked the possible solution is subtract two column like:
new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])
But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!
for 101 columns how to simply traverse all column and subtract its values?
Answers:
You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here’s one way to do it in PySpark:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col, df1[col] - df2[col])
This will create a new dataframe with the subtracted values for each column.
Edit: (to address @Kay’s comments)
The error you’re encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)
That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.
Within a single select with a python list comprehension :
columns = df1.columns
df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))
Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:
col1 col2 ... col100
1 2 ... 100
df2:
col1 col2 ... col100
5 4 ... 20
And I want to get the final dataframe with df1 – df2 :
new df:
col1 col2 ... col100
-4 -2 ... 80
I checked the possible solution is subtract two column like:
new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])
But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!
for 101 columns how to simply traverse all column and subtract its values?
You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here’s one way to do it in PySpark:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col, df1[col] - df2[col])
This will create a new dataframe with the subtracted values for each column.
Edit: (to address @Kay’s comments)
The error you’re encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)
That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.
Within a single select with a python list comprehension :
columns = df1.columns
df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))