How to transform a dataset to basic metrics dataset on various date based rollup in a big dataset using pyspark

Question:

I have a dataset that looks like this.

Date Time Stock-a Stock-b Stock-c
2023-01-01 10:30 10 20 30
2023-01-01 11:30 11 21 31
2023-01-02 01:30 15 19 18
2023-01-02 12:30 6 25 8

I want to convert that into a dataset that looks like this

Date Stock Name Mean Stddev
2023-01-01 Stock-a mean value standard deviation
2023-01-01 Stock-b mean value standard deviation
2023-01-02 Stock-a mean value standard deviation

This is my code

import pyspark
from pyspark.sql.functions import expr
#Create spark session
data = [("2023-01-01","10:30", 10, 20, 30), ("2023-01-01","11:30", 11, 21, 31) , 
      ("2023-01-01","13:30", 1, 2, 3),("2023-01-01","14:30", 110, 210, 310),("2023-01-02","01:30", 21, 21, 21), 
      ("2023-01-02","08:30", 11, 21, 31),("2023-01-02","11:30", 110, 210, 131),("2023-01-03","11:30", 10, 20, 30), 
      ("2023-01-03","12:30", 11, 21, 31),("2023-01-03","14:30", 8, 12, 13),("2023-01-03","15:30", 11, 21, 31)]

columns= ["Date","Time","Stock-a", "Stock-b", "Stock-c"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()

from pyspark.sql.functions import expr, mean, stddev
columns = ["Stock-a", "Stock-b", "Stock-c"]
metrics_aggs = df.groupBy('Date').agg(
  *[mean(col).alias("mean_" + col) for col in columns],
  *[stddev(col).alias('std_' + col) for col in columns]
)
metrics_aggs.show()

Somehow I have to find a way to pivot on the column name and then just show the mean and standard deviation value as columns.
Any pointers or ideas to solve this?

Asked By: pramodh

||

Answers:

You can use stack (aka "unpivot") to transform the data into a dataframe consisting of four columns Date, Time, Stock and Value:

+----------+-----+-------+-----+
|      Date| Time|  Stock|Value|
+----------+-----+-------+-----+
|2023-01-01|10:30|Stock-a|   10|
|2023-01-01|10:30|Stock-b|   20|
|2023-01-01|10:30|Stock-c|   30|
|2023-01-01|11:30|Stock-a|   11|
|2023-01-01|11:30|Stock-b|   21|
...

Then this table can be grouped by Date and Stock to get the expected result:

from pyspark.sql import functions as F

df = ...

# ignore Date and Time when stacking
value_cols = df.columns
value_cols.remove('Date')
value_cols.remove('Time')

# prepare the parameters for stack
value_col_names = ",".join([f'"{c}", `{c}`' for c in value_cols])
expr = ['Date', 'Time', f'stack({len(value_cols)}, {value_col_names}) as (Stock, Value)']

# stack the data and group it by Date and Stock
df.selectExpr(expr) 
    .groupBy('Date', 'Stock') 
    .agg(F.mean('Value'), F.stddev('Value')) 
    .orderBy('Date', 'Stock') 
    .show()

Result:

+----------+-------+------------------+------------------+
|      Date|  Stock|        avg(Value)|stddev_samp(Value)|
+----------+-------+------------------+------------------+
|2023-01-01|Stock-a|              33.0|51.529926579933466|
|2023-01-01|Stock-b|             63.25| 98.22211224227127|
|2023-01-01|Stock-c|              93.5|144.91491756659607|
|2023-01-02|Stock-a|47.333333333333336| 54.50076452063157|
|2023-01-02|Stock-b|              84.0|109.11920087683927|
...
Answered By: werner
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.