How to subtract all column values of two PySpark dataframe?

Question:

Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:

col1 col2 ... col100
 1    2   ...  100

df2:

col1 col2 ... col100
5     4   ...  20

And I want to get the final dataframe with df1 – df2 :
new df:

col1 col2  ... col100
-4     -2  ...   80

I checked the possible solution is subtract two column like:

new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])

But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!

for 101 columns how to simply traverse all column and subtract its values?

Asked By: Kay

||

Answers:

You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here’s one way to do it in PySpark:

columns = df1.columns

for col in columns:
    df1 = df1.withColumn(col, df1[col] - df2[col])

This will create a new dataframe with the subtracted values for each column.

Edit: (to address @Kay’s comments)
The error you’re encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:

columns = df1.columns

for col in columns:
    df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)

That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.

Answered By: Sold Out

Within a single select with a python list comprehension :

columns = df1.columns

df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))
Answered By: Steven
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.