Why agg() in PySpark is only able to summarize one column of a DataFrame at a time?


For the below dataframe

df = spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)], schema=['name','High'])

When I try to find min & max I am only getting min value in output.

|min(High)  |
|    2094900|

Why can’t agg() give both max & min like in Pandas?

Asked By: GeorgeOfTheRF



As you can see here:


Compute aggregates and returns the result as a DataFrame.

The available aggregate functions are avg, max, min, sum, count.

If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.

Alternatively, exprs can also be a list of aggregate Column expressions.

Parameters: exprs – a dict mapping from column name (string) to aggregate functions (string), or a list of Column.

You can use a list of column and apply the function that you need on every column, like this:

>>> from pyspark.sql import functions as F

>>> df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()
|      4.3|    7.677|   5.9885|   11.977|
Answered By: titiro89