Using .expr() and arithmetics: how to add multiple (calculated) columns to dataframe within one expression

Question

So I have a spark dataframe with some columns and I want to add some new columns which are the product of the initial columns: new_col1 = col_1 * col_2 & new_col2 = col_3 * col_4.
See the data frames below as an example.

df=

| id | col_1| col_2| col_3| col_4|
|:---|:----:|:-----|:-----|:-----|
|1   | a    | x    |  d1  |  u   |
|2   | b    | y    |  e1  |  v   |
|3   | c    | z    |  f1  |  w   |

df_new =

| id | col_1| col_2| col_3| col_4| new_col1 | new_col2 |
|:---|:----:|:-----|:-----|:-----|:--------:|:--------:| 
|1   | a    | x    |  d1  |  u   |   a*x    |  d1*u    |
|2   | 2    | 3    |  e1  |  v   |   6      |  e1*v    |
|3   | c    | z    |  4   |  2.5 |   c*z    |  10      |

Of course, this would be rather straightforward using

df_new = (
df
.withColumn(newcol_1, col(col_1)*col(col_2))
.withColumn(newcol_2, col(col_3)*col(col_4))
)

However, the number of times that this operation is variable; so the number of new_col’s is variable. Besides this happens in a join. So I would really like to do this all in 1 expression.

My solution was this, I have a config file with a dictionary with columns part of the operations (this is the place where I can add more columns to be calculated) (don’t mind the nesting of the dictionary)

"multiplied_parameters": {
        "mult_parameter1": {"name": "new_col1", "col_parts": ["col_1","col_2"]},
        "mult_parameter2": {"name": "new_col2", "col_parts": ["col_3, col_4"]},
    },

Then I use this for loop to create an expression which produces the expression:
col_1*col_2 as new_col1, ``col_3*col_4 as new_col2

        newcol_lst = []
        for keyval in dictionary["multiplied_parameters"].items():
            newcol_lst.append(
                f'{"*".join(keyval[1]["col_parts"])} as {keyval[1]["name"]}'
                )
        operation = f'{", ".join(newcol_lst)}'

col_lst = ["col_1", "col_2", "col_3", "col_4"]
df_new = (
            df
            .select(
                *col_lst, 
                expr(operation),
            )

This gives me the error.

ParseException: 
mismatched input ',' expecting {<EOF>, '-'}(line 1, pos 33)

== SQL ==
col_1*col_2 as new_col1, col_3*col_4 as new_col2
-----------------------^^^

So the problem is in the way that I concatenate the two operations. I also know that this the problem because when the dictionary only has 1 key (mult_parameter1) then I don’t have any problem.

The question is thus, in essence, how can I use .expr() with two different arithmetics to determine two different calculated columns.

Asked By: Cbru

||

Source

Answer 1

I don’t think that expr can do what you are trying to do. However, you don’t have to concatenate all your expressions and use a single expr, instead you can do something like this

df_new = (
            df
            .select(
                *(col_lst + [expr(nc) for nc in new_col_list])
            )

The above code is untested, but in general a technique of creating a list of columns is common in Spark.

Answered By: Vladimir Prus

Answer 2

In the end is used .selectExpr(), which did the job.
This looks like this:

    col_lst = ["col_1", "col_2", "col_3", "col_4"]
    df_new = (
            df
            .selectExpr(
                *col_lst, 
                *newcol_lst
            )

This works like a charm.

I tested the solution of @vladimir prus, and that works as well, thanks for your input!

Answered By: Cbru

Using .expr() and arithmetics: how to add multiple (calculated) columns to dataframe within one expression

Question:

Answers: