PySpark row-wise function composition

Question:

As a simplified example, I have a dataframe “df” with columns “col1,col2” and I want to compute a row-wise maximum after applying a function to each column :

def f(x):
    return (x+1)

max_udf=udf(lambda x,y: max(x,y), IntegerType())
f_udf=udf(f, IntegerType())

df2=df.withColumn("result", max_udf(f_udf(df.col1),f_udf(df.col2)))

So if df:

col1   col2
1      2
3      0

Then

df2:

col1   col2  result
1      2     3
3      0     4

The above doesn’t seem to work and produces “Cannot evaluate expression: PythonUDF#f…”

I’m absolutely positive “f_udf” works just fine on my table, and the main issue is with the max_udf.

Without creating extra columns or using basic map/reduce, is there a way to do the above entirely using dataframes and udfs? How should I modify “max_udf”?

I’ve also tried:

max_udf=udf(max, IntegerType())

which produces the same error.

I’ve also confirmed that the following works:

df2=(df.withColumn("temp1", f_udf(df.col1))
       .withColumn("temp2", f_udf(df.col2))

df2=df2.withColumn("result", max_udf(df2.temp1,df2.temp2))

Why is it that I can’t do these in one go?

I would like to see an answer that generalizes to any function “f_udf” and “max_udf.”

Asked By: Alex R.

||

Answers:

UserDefinedFunction is throwing error while accepting UDFs as their arguments.

You can modify the max_udf like below to make it work.

df = sc.parallelize([(1, 2), (3, 0)]).toDF(["col1", "col2"])

max_udf = udf(lambda x, y: max(x + 1, y + 1), IntegerType())

df2 = df.withColumn("result", max_udf(df.col1, df.col2))

Or

def f_udf(x):
    return (x + 1)

max_udf = udf(lambda x, y: max(x, y), IntegerType())
## f_udf=udf(f, IntegerType())

df2 = df.withColumn("result", max_udf(f_udf(df.col1), f_udf(df.col2)))

Note:

The second approach is valid if and only if internal functions (here f_udf) generate valid SQL expressions.

It works here because f_udf(df.col1) and f_udf(df.col2) are evaluated as Column<b'(col1 + 1)'> and Column<b'(col2 + 1)'> respectively, before being passed to max_udf. It wouldn’t work with arbitrary function.

It wouldn’t work if we try for example something like this:

from math import exp

df.withColumn("result", max_udf(exp(df.col1), exp(df.col2)))
Answered By: Mohan

I had a similar problem and found the solution in the answer to this stackoverflow question

To pass multiple columns or a whole row to an UDF use a struct:

from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType

df = sqlContext.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))

count_empty_columns = udf(lambda row: len([x for x in row if x == None]), IntegerType())

new_df = df.withColumn("null_count", count_empty_columns(struct([df[x] for x in df.columns])))

new_df.show()

returns:

+----+----+----------+
|   a|   b|null_count|
+----+----+----------+
|null|null|         2|
|   1|null|         1|
|null|   2|         1|
+----+----+----------+
Answered By: Christoph Hösler

Below a useful code especially made to create any new column by simply calling a top-level business rule, completely isolated from the technical and heavy Spark’s stuffs (no need to spend $ and to feel dependant of Databricks libraries anymore).
My advice is, in your organization try to do things simply and cleanly in life, for the benefits of top-level data users:

def createColumnFromRule(df, columnName, ruleClass, ruleName, inputColumns=None, inputValues=None, columnType=None):
    from pyspark.sql import functions as F
    from pyspark.sql import types as T
    def _getSparkClassType(shortType):
        defaultSparkClassType = "StringType"
        typesMapping = {
            "bigint"    : "LongType",
            "binary"    : "BinaryType",
            "boolean"   : "BooleanType",
            "byte"      : "ByteType",
            "date"      : "DateType",
            "decimal"   : "DecimalType",
            "double"    : "DoubleType",
            "float"     : "FloatType",
            "int"       : "IntegerType",
            "integer"   : "IntegerType",
            "long"      : "LongType",
            "numeric"   : "NumericType",
            "string"    : defaultSparkClassType,
            "timestamp" : "TimestampType"
        }
        sparkClassType = None
        try:
            sparkClassType = typesMapping[shortType]
        except:
            sparkClassType = defaultSparkClassType
        return sparkClassType
    if (columnType != None): sparkClassType = _getSparkClassType(columnType)
    else: sparkClassType = "StringType"
    aUdf = eval("F.udf(ruleClass." + ruleName + ", T." + sparkClassType + "())")
    columns = None
    values = None
    if (inputColumns != None): columns = F.struct([df[column] for column in inputColumns])
    if (inputValues != None): values = F.struct([F.lit(value) for value in inputValues])
    # Call the rule
    if (inputColumns != None and inputValues != None): df = df.withColumn(columnName, aUdf(columns, values))
    elif (inputColumns != None): df = df.withColumn(columnName, aUdf(columns, F.lit(None)))
    elif (inputValues != None): df = df.withColumn(columnName, aUdf(F.lit(None), values))
    # Create a Null column otherwise
    else:
        if (columnType != None):
            df = df.withColumn(columnName, F.lit(None).cast(columnType))
        else:
            df = df.withColumn(columnName, F.lit(None))
    # Return the resulting dataframe
    return df

Usage example:

# Define your business rule (you can get columns and values)
class CustomerRisk:
    def churnRisk(self, columns=None, values=None):
        isChurnRisk = False
        # ... Rule implementation starts here
        if (values != None):
            if (values[0] == "FORCE_CHURN=true"): isChurnRisk = True
        if (isChurnRisk == False and columns != None):
            if (columns["AGE"]) <= 25): isChurnRisk = True
        # ...
        return isChurnRisk

# Execute the rule, it will create your new column in one line of code, that's all, easy isn't ?
# And look how to pass columns and values, it's really easy !
df = createColumnFromRule(df, columnName="CHURN_RISK", ruleClass=CustomerRisk(), ruleName="churnRisk", columnType="boolean", inputColumns=["NAME", "AGE", "ADDRESS"], inputValues=["FORCE_CHURN=true", "CHURN_RISK=100%"])
Answered By: prossblad

The best way to handle this is to escape the pyspark.sql.DataFrame representation and use pyspark.RDDs via pyspark.sql.Row.asDict() and [pyspark.RDD.map()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html#pyspark.RDD.map).

import typing

# Save yourself some pain and always import these things: functions as F and types as T
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Row, SparkSession, SQLContext


spark = (
    SparkSession.builder.appName("Stack Overflow Example")
    .getOrCreate()
)
sc = spark.sparkContext

# sqlContet is needed sometimes to create DataFrames from RDDs
sqlContext = SQLContext(sc)

df = sc.parallelize([Row(**{"a": "hello", "b": 1, "c": 2}), Row(**{"a": "goodbye", "b": 2, "c": 1})]).toDF(["a", "b", "c"])


def to_string(record:dict) -> Row:
    """Create a readable string representation of the record"""
    
    record["readable"] = f'Word: {record["a"]} A: {record["b"]} B: {record["c"]}'
    return Row(**record)


# Apply the function with a map after converting the Row to a dict
readable_rdd = df.rdd.map(lambda x: x.asDict()).map(to_string)

# Test the function without running the entire DataFrame through it
print(readable_rdd.first())

# This results in: Row(a='hello', b=1, c=2, readable='Word: hello A: 1 B: 2')

# Sometimes you can use `toDF()` to get a dataframe
readable_df = readable_rdd.toDF()

readable_df.show()

# +-------+---+---+--------------------+
# |      a|  b|  c|            readable|
# +-------+---+---+--------------------+
# |  hello|  1|  2|Word: hello A: 1 ...|
# |goodbye|  2|  1|Word: goodbye A: ...|
# +-------+---+---+--------------------+

# Sometimes you have to use createDataFrame with a specified schema
schema = T.StructType(
    [
        T.StructField("a", T.StringType(), True),
        T.StructField("b", T.IntegerType(), True),
        T.StructField("c", T.StringType(), True),
        T.StructField("readable", T.StringType(), True),
    ]
)

# This is more reliable, you should use it in production!
readable_df = sqlContext.createDataFrame(readable_rdd, schema)

readable_df.show()

# +-------+---+---+--------------------+
# |      a|  b|  c|            readable|
# +-------+---+---+--------------------+
# |  hello|  1|  2|Word: hello A: 1 ...|
# |goodbye|  2|  1|Word: goodbye A: ...|
# +-------+---+---+--------------------+

Sometimes RDD.map() functions can’t use certain Python libraries because mappers get serialized and so you need to partition the data into enough partitions to occupy all the cores of the cluster and then use pyspark.RDD.mapPartition() to process an entire partition (just an Iterable of dicts) at a time. This enables you to instantiate an expensive object once – like a spaCy Language model – and apply it to one record at a time without recreating it.

def to_string_partition(partition:typing.Iterable[dict]) -> typing.Iterable[Row]:
    """Add a readable string form to an entire partition"""
    # Instantiate expensive objects here
    
    # Apply these objects' methods here
    for record in partition:
        record["readable"] = f'Word: {record["a"]} A: {record["b"]} B: {record["c"]}'
        yield Row(**record)


readable_rdd = df.rdd.map(lambda x: x.asDict()).mapPartitions(to_string_partition)

print(readable_rdd.first())

# Row(a='hello', b=1, c=2, readable='Word: hello A: 1 B: 2')

# mapPartitions are more likely to require a specified schema
schema = T.StructType(
    [
        T.StructField("a", T.StringType(), True),
        T.StructField("b", T.IntegerType(), True),
        T.StructField("c", T.StringType(), True),
        T.StructField("readable", T.StringType(), True),
    ]
)

# This is more reliable, you should use it in production!
readable_df = sqlContext.createDataFrame(readable_rdd, schema)

readable_df.show()

# +-------+---+---+--------------------+
# |      a|  b|  c|            readable|
# +-------+---+---+--------------------+
# |  hello|  1|  2|Word: hello A: 1 ...|
# |goodbye|  2|  1|Word: goodbye A: ...|
# +-------+---+---+--------------------+

The DataFrame APIs are good because they allow SQL-like operations to be faster, but sometimes you need the power of direct Python without any limitations and it will greatly benefit your analytics practice to learn to employ RDDs. You can group records for example and then evaluate the entire group in RAM, just so long as it fits – which you can arrange by altering the partition key and limiting workers/increasing their RAM.

import numpy as np


def median_b(x):
    """Process a group and determine the median value"""
    
    key = x[0]
    values = x[1]
    
    # Get the median value
    m = np.median([record["b"] for record in values])

    # Return a Row of the median for each group
    return Row(**{"a": key, "median_b": m})


median_b_rdd = df.rdd.map(lambda x: x.asDict()).groupBy(lambda x: x["a"]).map(median_b)
median_b_rdd.first()

# Row(a='hello', median_b=1.0)
Answered By: rjurney