i want to sum date in a looping 13 times using pyspark
Question:
Please help me to solve this issue, as I am still new to Python/Pyspark.
I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column.
I have a master table like this :
id
date
1
2019-02-21 10:00:00
2
2019-02-27 09:50:33
etc
I have created a script but wrong, instead created a new column, even though I want to create a new column..
I want this kind of result :
id
x
date
1
7
2019-02-21
1
14
2019-02-28
1
21
2019-03-14
1
28
2019-03-21
1
35
2019-03-28
1
42
2019-04-04
1
49
2019-04-11
1
56
2019-04-18
1
63
2019-04-25
1
70
2019-05-02
1
77
2019-05-09
1
84
2019-05-16
1
91
2019-05-16
etc
Answers:
Use sequence
function and then explode
.
Example:
df = spark.createDataFrame([('1','2019-02-21 10:00:00'),('2','2019-02-27 09:50:33')],['id','date']).
withColumn("seq", expr("""sequence(0,91,7)""")).
withColumn("date", to_date(col("date")))
df = df.select("id","date",explode(col("seq")))
df.withColumn("date", expr("""date_add(date, col)""")).show(200, False)
#+---+----------+---+
#|id |date |col|
#+---+----------+---+
#|1 |2019-02-21|0 |
#|1 |2019-02-28|7 |
#|1 |2019-03-07|14 |
#|1 |2019-03-14|21 |
#|1 |2019-03-21|28 |
#|1 |2019-03-28|35 |
#|1 |2019-04-04|42 |
#|1 |2019-04-11|49 |
#|1 |2019-04-18|56 |
#|1 |2019-04-25|63 |
#|1 |2019-05-02|70 |
#|1 |2019-05-09|77 |
#|1 |2019-05-16|84 |
#|1 |2019-05-23|91 |
#|2 |2019-02-27|0 |
#|2 |2019-03-06|7 |
#|2 |2019-03-13|14 |
#|2 |2019-03-20|21 |
#|2 |2019-03-27|28 |
#|2 |2019-04-03|35 |
#|2 |2019-04-10|42 |
#|2 |2019-04-17|49 |
#|2 |2019-04-24|56 |
#|2 |2019-05-01|63 |
#|2 |2019-05-08|70 |
#|2 |2019-05-15|77 |
#|2 |2019-05-22|84 |
#|2 |2019-05-29|91 |
#+---+----------+---+
Please help me to solve this issue, as I am still new to Python/Pyspark.
I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column.
I have a master table like this :
id | date |
---|---|
1 | 2019-02-21 10:00:00 |
2 | 2019-02-27 09:50:33 |
etc
I have created a script but wrong, instead created a new column, even though I want to create a new column..
I want this kind of result :
id | x | date |
---|---|---|
1 | 7 | 2019-02-21 |
1 | 14 | 2019-02-28 |
1 | 21 | 2019-03-14 |
1 | 28 | 2019-03-21 |
1 | 35 | 2019-03-28 |
1 | 42 | 2019-04-04 |
1 | 49 | 2019-04-11 |
1 | 56 | 2019-04-18 |
1 | 63 | 2019-04-25 |
1 | 70 | 2019-05-02 |
1 | 77 | 2019-05-09 |
1 | 84 | 2019-05-16 |
1 | 91 | 2019-05-16 |
etc
Use sequence
function and then explode
.
Example:
df = spark.createDataFrame([('1','2019-02-21 10:00:00'),('2','2019-02-27 09:50:33')],['id','date']).
withColumn("seq", expr("""sequence(0,91,7)""")).
withColumn("date", to_date(col("date")))
df = df.select("id","date",explode(col("seq")))
df.withColumn("date", expr("""date_add(date, col)""")).show(200, False)
#+---+----------+---+
#|id |date |col|
#+---+----------+---+
#|1 |2019-02-21|0 |
#|1 |2019-02-28|7 |
#|1 |2019-03-07|14 |
#|1 |2019-03-14|21 |
#|1 |2019-03-21|28 |
#|1 |2019-03-28|35 |
#|1 |2019-04-04|42 |
#|1 |2019-04-11|49 |
#|1 |2019-04-18|56 |
#|1 |2019-04-25|63 |
#|1 |2019-05-02|70 |
#|1 |2019-05-09|77 |
#|1 |2019-05-16|84 |
#|1 |2019-05-23|91 |
#|2 |2019-02-27|0 |
#|2 |2019-03-06|7 |
#|2 |2019-03-13|14 |
#|2 |2019-03-20|21 |
#|2 |2019-03-27|28 |
#|2 |2019-04-03|35 |
#|2 |2019-04-10|42 |
#|2 |2019-04-17|49 |
#|2 |2019-04-24|56 |
#|2 |2019-05-01|63 |
#|2 |2019-05-08|70 |
#|2 |2019-05-15|77 |
#|2 |2019-05-22|84 |
#|2 |2019-05-29|91 |
#+---+----------+---+