Using monotonically_increasing_id() for assigning row number to pyspark dataframe
Question:
I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below:
df1 = df1.withColumn("idx", monotonically_increasing_id())
Now df1 has 26,572,528 records. So I was expecting idx value from 0-26,572,527.
But when I select max(idx), its value is strangely huge: 335,008,054,165.
What’s going on with this function?
is it reliable to use this function for merging with another dataset having similar number of records?
I have some 300 dataframes which I want to combine into a single dataframe. So one dataframe contains IDs and others contain different records corresponding to them row-wise
Answers:
Edit: Full examples of the ways to do this and the risks can be found here
From the documentation
A column that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
Thus, it is not like an auto-increment id in RDBs and it is not reliable for merging.
If you need an auto-increment behavior like in RDBs and your data is sortable, then you can use row_number
df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
| 1| ....... |
| 2| ....... |
| 3| ..........|
+---+-----------+
If your data is not sortable and you don’t mind using rdds to create the indexes and then fall back to dataframes, you can use rdd.zipWithIndex()
An example can be found here
In short:
# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()
df.show()
# your data | indexes
+---------------------+---+
| _1 | _2|
+-----------=---------+---+
|[data col1,data col2]| 0|
|[data col1,data col2]| 1|
|[data col1,data col2]| 2|
+---------------------+---+
You will probably need some more transformations after that to get your dataframe to what you need it to be. Note: not a very performant solution.
Hope this helps. Good luck!
Edit:
Come to think about it, you can combine the monotonically_increasing_id
to use the row_number
:
# create a monotonically increasing id
df = df.withColumn("idx", monotonically_increasing_id())
# then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number`
df.createOrReplaceTempView('df')
new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')
Not sure about performance though.
using api functions you can do simply as the following
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
df1 = df1.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
df1 = df1.withColumn("idx", F.row_number().over(windowSpec)).show()
I hope the answer is helpful
I found the solution by @mkaran useful, But for me there was no ordering column while using window function. I wanted to maintain the order of rows of dataframe as their indexes (what you would see in a pandas dataframe). Hence the solution in edit section came of use. Since it is a good solution (if performance is not a concern), I would like to share it as a separate answer.
df_index = sdf_drop.withColumn("idx", monotonically_increasing_id())
# Create the window specification
w = Window.orderBy("idx")
# Use row number with the window specification
df_index = df_index.withColumn("index", F.row_number().over(w))
# Drop the created increasing data column
df2_index = df_index.drop("idx")
df
is your original dataframe and df_index
is new dataframe.
To merge dataframes of same size, use zip
on rdds
from pyspark.sql.types import StructType
spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
But note the following from help of the method,
Assumes that the two RDDs have the same number of partitions and the same
number of elements in each partition (e.g. one was made through
a map on the other).
Building on @mkaran answer,
df.coalesce(1).withColumn("idx", monotonicallyIncreasingId())
Using .coalesce(1)
puts the Dataframe in one partition, and so have monotonically increasing and successive index column. Make sure it’s reasonably sized to be in one partition so you avoid potential problems afterwards.
Worth noting that I sorted my Dataframe in ascending order beforehand.
Here’s a preview comparison of what it looked like for me, with and without coalesce, where I had a summary Dataframe of 50 rows,
df.coalesce(1).withColumn("No", monotonicallyIncreasingId()).show(60)
startTimes
endTimes
No
2019-11-01 05:39:50
2019-11-01 06:12:50
0
2019-11-01 06:23:10
2019-11-01 06:23:50
1
2019-11-01 06:26:49
2019-11-01 06:46:29
2
2019-11-01 07:00:29
2019-11-01 07:04:09
3
2019-11-01 15:24:29
2019-11-01 16:04:59
4
2019-11-01 16:23:38
2019-11-01 17:27:58
5
2019-11-01 17:32:18
2019-11-01 17:47:58
6
2019-11-01 17:54:18
2019-11-01 18:00:00
7
2019-11-02 04:42:40
2019-11-02 04:49:20
8
2019-11-02 05:11:40
2019-11-02 05:22:00
9
df.withColumn("runNo", monotonically_increasing_id).show(60)
startTimes
endTimes
No
2019-11-01 05:39:50
2019-11-01 06:12:50
0
2019-11-01 06:23:10
2019-11-01 06:23:50
8589934592
2019-11-01 06:26:49
2019-11-01 06:46:29
17179869184
2019-11-01 07:00:29
2019-11-01 07:04:09
25769803776
2019-11-01 15:24:29
2019-11-01 16:04:59
34359738368
2019-11-01 16:23:38
2019-11-01 17:27:58
42949672960
2019-11-01 17:32:18
2019-11-01 17:47:58
51539607552
2019-11-01 17:54:18
2019-11-01 18:00:00
60129542144
2019-11-02 04:42:40
2019-11-02 04:49:20
68719476736
2019-11-02 05:11:40
2019-11-02 05:22:00
77309411328
If you have a large DataFrame and you don’t want OOM error problems, I suggest using zipWithIndex():
df1 = df.rdd.zipWithIndex().toDF()
df2 = df1.select(col("_1.*"),col("_2").alias('increasing_id'))
df2.show()
where df is your initial DataFrame.
More solutions are shown by Databricks documentation. Be careful with the row_number() function that moves all the rows in one partition and can cause OutOfMemoryError errors.
I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below:
df1 = df1.withColumn("idx", monotonically_increasing_id())
Now df1 has 26,572,528 records. So I was expecting idx value from 0-26,572,527.
But when I select max(idx), its value is strangely huge: 335,008,054,165.
What’s going on with this function?
is it reliable to use this function for merging with another dataset having similar number of records?
I have some 300 dataframes which I want to combine into a single dataframe. So one dataframe contains IDs and others contain different records corresponding to them row-wise
Edit: Full examples of the ways to do this and the risks can be found here
From the documentation
A column that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
Thus, it is not like an auto-increment id in RDBs and it is not reliable for merging.
If you need an auto-increment behavior like in RDBs and your data is sortable, then you can use row_number
df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
| 1| ....... |
| 2| ....... |
| 3| ..........|
+---+-----------+
If your data is not sortable and you don’t mind using rdds to create the indexes and then fall back to dataframes, you can use rdd.zipWithIndex()
An example can be found here
In short:
# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()
df.show()
# your data | indexes
+---------------------+---+
| _1 | _2|
+-----------=---------+---+
|[data col1,data col2]| 0|
|[data col1,data col2]| 1|
|[data col1,data col2]| 2|
+---------------------+---+
You will probably need some more transformations after that to get your dataframe to what you need it to be. Note: not a very performant solution.
Hope this helps. Good luck!
Edit:
Come to think about it, you can combine the monotonically_increasing_id
to use the row_number
:
# create a monotonically increasing id
df = df.withColumn("idx", monotonically_increasing_id())
# then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number`
df.createOrReplaceTempView('df')
new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')
Not sure about performance though.
using api functions you can do simply as the following
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
df1 = df1.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
df1 = df1.withColumn("idx", F.row_number().over(windowSpec)).show()
I hope the answer is helpful
I found the solution by @mkaran useful, But for me there was no ordering column while using window function. I wanted to maintain the order of rows of dataframe as their indexes (what you would see in a pandas dataframe). Hence the solution in edit section came of use. Since it is a good solution (if performance is not a concern), I would like to share it as a separate answer.
df_index = sdf_drop.withColumn("idx", monotonically_increasing_id())
# Create the window specification
w = Window.orderBy("idx")
# Use row number with the window specification
df_index = df_index.withColumn("index", F.row_number().over(w))
# Drop the created increasing data column
df2_index = df_index.drop("idx")
df
is your original dataframe and df_index
is new dataframe.
To merge dataframes of same size, use zip
on rdds
from pyspark.sql.types import StructType
spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
But note the following from help of the method,
Assumes that the two RDDs have the same number of partitions and the same
number of elements in each partition (e.g. one was made through
a map on the other).
Building on @mkaran answer,
df.coalesce(1).withColumn("idx", monotonicallyIncreasingId())
Using .coalesce(1)
puts the Dataframe in one partition, and so have monotonically increasing and successive index column. Make sure it’s reasonably sized to be in one partition so you avoid potential problems afterwards.
Worth noting that I sorted my Dataframe in ascending order beforehand.
Here’s a preview comparison of what it looked like for me, with and without coalesce, where I had a summary Dataframe of 50 rows,
df.coalesce(1).withColumn("No", monotonicallyIncreasingId()).show(60)
startTimes | endTimes | No |
---|---|---|
2019-11-01 05:39:50 | 2019-11-01 06:12:50 | 0 |
2019-11-01 06:23:10 | 2019-11-01 06:23:50 | 1 |
2019-11-01 06:26:49 | 2019-11-01 06:46:29 | 2 |
2019-11-01 07:00:29 | 2019-11-01 07:04:09 | 3 |
2019-11-01 15:24:29 | 2019-11-01 16:04:59 | 4 |
2019-11-01 16:23:38 | 2019-11-01 17:27:58 | 5 |
2019-11-01 17:32:18 | 2019-11-01 17:47:58 | 6 |
2019-11-01 17:54:18 | 2019-11-01 18:00:00 | 7 |
2019-11-02 04:42:40 | 2019-11-02 04:49:20 | 8 |
2019-11-02 05:11:40 | 2019-11-02 05:22:00 | 9 |
df.withColumn("runNo", monotonically_increasing_id).show(60)
startTimes | endTimes | No |
---|---|---|
2019-11-01 05:39:50 | 2019-11-01 06:12:50 | 0 |
2019-11-01 06:23:10 | 2019-11-01 06:23:50 | 8589934592 |
2019-11-01 06:26:49 | 2019-11-01 06:46:29 | 17179869184 |
2019-11-01 07:00:29 | 2019-11-01 07:04:09 | 25769803776 |
2019-11-01 15:24:29 | 2019-11-01 16:04:59 | 34359738368 |
2019-11-01 16:23:38 | 2019-11-01 17:27:58 | 42949672960 |
2019-11-01 17:32:18 | 2019-11-01 17:47:58 | 51539607552 |
2019-11-01 17:54:18 | 2019-11-01 18:00:00 | 60129542144 |
2019-11-02 04:42:40 | 2019-11-02 04:49:20 | 68719476736 |
2019-11-02 05:11:40 | 2019-11-02 05:22:00 | 77309411328 |
If you have a large DataFrame and you don’t want OOM error problems, I suggest using zipWithIndex():
df1 = df.rdd.zipWithIndex().toDF()
df2 = df1.select(col("_1.*"),col("_2").alias('increasing_id'))
df2.show()
where df is your initial DataFrame.
More solutions are shown by Databricks documentation. Be careful with the row_number() function that moves all the rows in one partition and can cause OutOfMemoryError errors.