How to transform a dataset to basic metrics dataset on various date based rollup in a big dataset using pyspark

Question

I have a dataset that looks like this.

Date	Time	Stock-a	Stock-b	Stock-c
2023-01-01	10:30	10	20	30
2023-01-01	11:30	11	21	31
2023-01-02	01:30	15	19	18
2023-01-02	12:30	6	25	8

I want to convert that into a dataset that looks like this

Date	Stock Name	Mean	Stddev
2023-01-01	Stock-a	mean value	standard deviation
2023-01-01	Stock-b	mean value	standard deviation
2023-01-02	Stock-a	mean value	standard deviation

This is my code

import pyspark
from pyspark.sql.functions import expr
#Create spark session
data = [("2023-01-01","10:30", 10, 20, 30), ("2023-01-01","11:30", 11, 21, 31) , 
      ("2023-01-01","13:30", 1, 2, 3),("2023-01-01","14:30", 110, 210, 310),("2023-01-02","01:30", 21, 21, 21), 
      ("2023-01-02","08:30", 11, 21, 31),("2023-01-02","11:30", 110, 210, 131),("2023-01-03","11:30", 10, 20, 30), 
      ("2023-01-03","12:30", 11, 21, 31),("2023-01-03","14:30", 8, 12, 13),("2023-01-03","15:30", 11, 21, 31)]

columns= ["Date","Time","Stock-a", "Stock-b", "Stock-c"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()

from pyspark.sql.functions import expr, mean, stddev
columns = ["Stock-a", "Stock-b", "Stock-c"]
metrics_aggs = df.groupBy('Date').agg(
  *[mean(col).alias("mean_" + col) for col in columns],
  *[stddev(col).alias('std_' + col) for col in columns]
)
metrics_aggs.show()

Somehow I have to find a way to pivot on the column name and then just show the mean and standard deviation value as columns.
Any pointers or ideas to solve this?

Asked By: pramodh

||

Source

Answer 1

You can use stack (aka "unpivot") to transform the data into a dataframe consisting of four columns Date, Time, Stock and Value:

+----------+-----+-------+-----+
|      Date| Time|  Stock|Value|
+----------+-----+-------+-----+
|2023-01-01|10:30|Stock-a|   10|
|2023-01-01|10:30|Stock-b|   20|
|2023-01-01|10:30|Stock-c|   30|
|2023-01-01|11:30|Stock-a|   11|
|2023-01-01|11:30|Stock-b|   21|
...

Then this table can be grouped by Date and Stock to get the expected result:

from pyspark.sql import functions as F

df = ...

# ignore Date and Time when stacking
value_cols = df.columns
value_cols.remove('Date')
value_cols.remove('Time')

# prepare the parameters for stack
value_col_names = ",".join([f'"{c}", `{c}`' for c in value_cols])
expr = ['Date', 'Time', f'stack({len(value_cols)}, {value_col_names}) as (Stock, Value)']

# stack the data and group it by Date and Stock
df.selectExpr(expr) 
    .groupBy('Date', 'Stock') 
    .agg(F.mean('Value'), F.stddev('Value')) 
    .orderBy('Date', 'Stock') 
    .show()

Result:

+----------+-------+------------------+------------------+
|      Date|  Stock|        avg(Value)|stddev_samp(Value)|
+----------+-------+------------------+------------------+
|2023-01-01|Stock-a|              33.0|51.529926579933466|
|2023-01-01|Stock-b|             63.25| 98.22211224227127|
|2023-01-01|Stock-c|              93.5|144.91491756659607|
|2023-01-02|Stock-a|47.333333333333336| 54.50076452063157|
|2023-01-02|Stock-b|              84.0|109.11920087683927|
...

Answered By: werner

How to transform a dataset to basic metrics dataset on various date based rollup in a big dataset using pyspark

Question:

Answers: