How to transform a dataset to basic metrics dataset on various date based rollup in a big dataset using pyspark
Question:
I have a dataset that looks like this.
Date
Time
Stock-a
Stock-b
Stock-c
2023-01-01
10:30
10
20
30
2023-01-01
11:30
11
21
31
2023-01-02
01:30
15
19
18
2023-01-02
12:30
6
25
8
I want to convert that into a dataset that looks like this
Date
Stock Name
Mean
Stddev
2023-01-01
Stock-a
mean value
standard deviation
2023-01-01
Stock-b
mean value
standard deviation
2023-01-02
Stock-a
mean value
standard deviation
This is my code
import pyspark
from pyspark.sql.functions import expr
#Create spark session
data = [("2023-01-01","10:30", 10, 20, 30), ("2023-01-01","11:30", 11, 21, 31) ,
("2023-01-01","13:30", 1, 2, 3),("2023-01-01","14:30", 110, 210, 310),("2023-01-02","01:30", 21, 21, 21),
("2023-01-02","08:30", 11, 21, 31),("2023-01-02","11:30", 110, 210, 131),("2023-01-03","11:30", 10, 20, 30),
("2023-01-03","12:30", 11, 21, 31),("2023-01-03","14:30", 8, 12, 13),("2023-01-03","15:30", 11, 21, 31)]
columns= ["Date","Time","Stock-a", "Stock-b", "Stock-c"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
from pyspark.sql.functions import expr, mean, stddev
columns = ["Stock-a", "Stock-b", "Stock-c"]
metrics_aggs = df.groupBy('Date').agg(
*[mean(col).alias("mean_" + col) for col in columns],
*[stddev(col).alias('std_' + col) for col in columns]
)
metrics_aggs.show()
Somehow I have to find a way to pivot on the column name and then just show the mean and standard deviation value as columns.
Any pointers or ideas to solve this?
Answers:
You can use stack (aka "unpivot") to transform the data into a dataframe consisting of four columns Date
, Time
, Stock
and Value
:
+----------+-----+-------+-----+
| Date| Time| Stock|Value|
+----------+-----+-------+-----+
|2023-01-01|10:30|Stock-a| 10|
|2023-01-01|10:30|Stock-b| 20|
|2023-01-01|10:30|Stock-c| 30|
|2023-01-01|11:30|Stock-a| 11|
|2023-01-01|11:30|Stock-b| 21|
...
Then this table can be grouped by Date
and Stock
to get the expected result:
from pyspark.sql import functions as F
df = ...
# ignore Date and Time when stacking
value_cols = df.columns
value_cols.remove('Date')
value_cols.remove('Time')
# prepare the parameters for stack
value_col_names = ",".join([f'"{c}", `{c}`' for c in value_cols])
expr = ['Date', 'Time', f'stack({len(value_cols)}, {value_col_names}) as (Stock, Value)']
# stack the data and group it by Date and Stock
df.selectExpr(expr)
.groupBy('Date', 'Stock')
.agg(F.mean('Value'), F.stddev('Value'))
.orderBy('Date', 'Stock')
.show()
Result:
+----------+-------+------------------+------------------+
| Date| Stock| avg(Value)|stddev_samp(Value)|
+----------+-------+------------------+------------------+
|2023-01-01|Stock-a| 33.0|51.529926579933466|
|2023-01-01|Stock-b| 63.25| 98.22211224227127|
|2023-01-01|Stock-c| 93.5|144.91491756659607|
|2023-01-02|Stock-a|47.333333333333336| 54.50076452063157|
|2023-01-02|Stock-b| 84.0|109.11920087683927|
...
I have a dataset that looks like this.
Date | Time | Stock-a | Stock-b | Stock-c |
---|---|---|---|---|
2023-01-01 | 10:30 | 10 | 20 | 30 |
2023-01-01 | 11:30 | 11 | 21 | 31 |
2023-01-02 | 01:30 | 15 | 19 | 18 |
2023-01-02 | 12:30 | 6 | 25 | 8 |
I want to convert that into a dataset that looks like this
Date | Stock Name | Mean | Stddev |
---|---|---|---|
2023-01-01 | Stock-a | mean value | standard deviation |
2023-01-01 | Stock-b | mean value | standard deviation |
2023-01-02 | Stock-a | mean value | standard deviation |
This is my code
import pyspark
from pyspark.sql.functions import expr
#Create spark session
data = [("2023-01-01","10:30", 10, 20, 30), ("2023-01-01","11:30", 11, 21, 31) ,
("2023-01-01","13:30", 1, 2, 3),("2023-01-01","14:30", 110, 210, 310),("2023-01-02","01:30", 21, 21, 21),
("2023-01-02","08:30", 11, 21, 31),("2023-01-02","11:30", 110, 210, 131),("2023-01-03","11:30", 10, 20, 30),
("2023-01-03","12:30", 11, 21, 31),("2023-01-03","14:30", 8, 12, 13),("2023-01-03","15:30", 11, 21, 31)]
columns= ["Date","Time","Stock-a", "Stock-b", "Stock-c"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
from pyspark.sql.functions import expr, mean, stddev
columns = ["Stock-a", "Stock-b", "Stock-c"]
metrics_aggs = df.groupBy('Date').agg(
*[mean(col).alias("mean_" + col) for col in columns],
*[stddev(col).alias('std_' + col) for col in columns]
)
metrics_aggs.show()
Somehow I have to find a way to pivot on the column name and then just show the mean and standard deviation value as columns.
Any pointers or ideas to solve this?
You can use stack (aka "unpivot") to transform the data into a dataframe consisting of four columns Date
, Time
, Stock
and Value
:
+----------+-----+-------+-----+
| Date| Time| Stock|Value|
+----------+-----+-------+-----+
|2023-01-01|10:30|Stock-a| 10|
|2023-01-01|10:30|Stock-b| 20|
|2023-01-01|10:30|Stock-c| 30|
|2023-01-01|11:30|Stock-a| 11|
|2023-01-01|11:30|Stock-b| 21|
...
Then this table can be grouped by Date
and Stock
to get the expected result:
from pyspark.sql import functions as F
df = ...
# ignore Date and Time when stacking
value_cols = df.columns
value_cols.remove('Date')
value_cols.remove('Time')
# prepare the parameters for stack
value_col_names = ",".join([f'"{c}", `{c}`' for c in value_cols])
expr = ['Date', 'Time', f'stack({len(value_cols)}, {value_col_names}) as (Stock, Value)']
# stack the data and group it by Date and Stock
df.selectExpr(expr)
.groupBy('Date', 'Stock')
.agg(F.mean('Value'), F.stddev('Value'))
.orderBy('Date', 'Stock')
.show()
Result:
+----------+-------+------------------+------------------+
| Date| Stock| avg(Value)|stddev_samp(Value)|
+----------+-------+------------------+------------------+
|2023-01-01|Stock-a| 33.0|51.529926579933466|
|2023-01-01|Stock-b| 63.25| 98.22211224227127|
|2023-01-01|Stock-c| 93.5|144.91491756659607|
|2023-01-02|Stock-a|47.333333333333336| 54.50076452063157|
|2023-01-02|Stock-b| 84.0|109.11920087683927|
...