join two rows itertively to create new table in spark with one row for each two rows in new table

Question:

Have a table where I want to go in range of two rows

id | col b | message
1  |  abc  | hello  |
2  |  abc  | world  |
3  |  abc 1| morning|
4  |  abc  |  night |
...|...    |  ....  |
100|  abc1 | Monday |
101|  abc1 | Tuesday|

How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.

Final table will look like this.

id | full message 
1  | 01:02,abc,world
3  | 03:04,abc,night
.. |................
100| 100:101,abc1,Tuesday
Asked By: lunbox

||

Answers:

With pandas, you can use:

group = np.arange(len(df))//2*2+1

(df.astype({'id': 'str'})
   .groupby(group)
   .agg(**{'id': ('id', ':'.join),
           'first': ('col b', 'first'),
           'last': ('message', 'last'),
          })
   .agg(','.join, axis=1)
   .reset_index(name='full message')
)

Output:

   id          full message
0   1         1:2,abc,world
1   3       3:4,abc 1,night
2   5  100:101,abc1,Tuesday
Answered By: mozway

In pyspark you can use Window, example

window = Window.orderBy('id').rowsBetween(Window.currentRow, 1)

(df
.withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window)))
.withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window)))
.withColumn('full_message', F.concat_ws(',', 'ids', 'messages'))
# select only the first entries, regardless of the id
.withColumn('seq_id', F.row_number().over(Window.orderBy('id')))
.filter(F.col('seq_id') % 2 != 0)
.select('id', 'full_message')
)

Output:

id  full_message
1   1:2,abc,world
3   3:4,abc 1,night
100 100:101,abc1,Tuesday
Answered By: metravod