join two rows itertively to create new table in spark with one row for each two rows in new table
Question:
Have a table where I want to go in range of two rows
id | col b | message
1 | abc | hello |
2 | abc | world |
3 | abc 1| morning|
4 | abc | night |
...|... | .... |
100| abc1 | Monday |
101| abc1 | Tuesday|
How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.
Final table will look like this.
id | full message
1 | 01:02,abc,world
3 | 03:04,abc,night
.. |................
100| 100:101,abc1,Tuesday
Answers:
With pandas, you can use:
group = np.arange(len(df))//2*2+1
(df.astype({'id': 'str'})
.groupby(group)
.agg(**{'id': ('id', ':'.join),
'first': ('col b', 'first'),
'last': ('message', 'last'),
})
.agg(','.join, axis=1)
.reset_index(name='full message')
)
Output:
id full message
0 1 1:2,abc,world
1 3 3:4,abc 1,night
2 5 100:101,abc1,Tuesday
In pyspark you can use Window, example
window = Window.orderBy('id').rowsBetween(Window.currentRow, 1)
(df
.withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window)))
.withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window)))
.withColumn('full_message', F.concat_ws(',', 'ids', 'messages'))
# select only the first entries, regardless of the id
.withColumn('seq_id', F.row_number().over(Window.orderBy('id')))
.filter(F.col('seq_id') % 2 != 0)
.select('id', 'full_message')
)
Output:
id full_message
1 1:2,abc,world
3 3:4,abc 1,night
100 100:101,abc1,Tuesday
Have a table where I want to go in range of two rows
id | col b | message
1 | abc | hello |
2 | abc | world |
3 | abc 1| morning|
4 | abc | night |
...|... | .... |
100| abc1 | Monday |
101| abc1 | Tuesday|
How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.
Final table will look like this.
id | full message
1 | 01:02,abc,world
3 | 03:04,abc,night
.. |................
100| 100:101,abc1,Tuesday
With pandas, you can use:
group = np.arange(len(df))//2*2+1
(df.astype({'id': 'str'})
.groupby(group)
.agg(**{'id': ('id', ':'.join),
'first': ('col b', 'first'),
'last': ('message', 'last'),
})
.agg(','.join, axis=1)
.reset_index(name='full message')
)
Output:
id full message
0 1 1:2,abc,world
1 3 3:4,abc 1,night
2 5 100:101,abc1,Tuesday
In pyspark you can use Window, example
window = Window.orderBy('id').rowsBetween(Window.currentRow, 1)
(df
.withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window)))
.withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window)))
.withColumn('full_message', F.concat_ws(',', 'ids', 'messages'))
# select only the first entries, regardless of the id
.withColumn('seq_id', F.row_number().over(Window.orderBy('id')))
.filter(F.col('seq_id') % 2 != 0)
.select('id', 'full_message')
)
Output:
id full_message
1 1:2,abc,world
3 3:4,abc 1,night
100 100:101,abc1,Tuesday