PySpark Drop Rows
Question:
how do you drop rows from an RDD in PySpark? Particularly the first row, since that tends to contain column names in my datasets. From perusing the API, I can’t seem to find an easy way to do this. Of course I could do this via Bash / HDFS, but I just want to know if this can be done from within PySpark.
Answers:
AFAIK there’s no ‘easy’ way to do this.
This should do the trick, though:
val header = data.first
val rows = data.filter(line => line != header)
Personally I think just using a filter to get rid of this stuff is the easiest way. But per your comment I have another approach. Glom the RDD so each partition is an array (I’m assuming you have 1 file per partition, and each file has the offending row on top) and then just skip the first element (this is with the scala api).
data.glom().map(x => for (elem <- x.drop(1){/*do stuff*/}) //x is an array so just skip the 0th index
Keep in mind one of the big features of RDD’s is that they are immutable, so naturally removing a row is a tricky thing to do
UPDATE:
Better solution.
rdd.mapPartions(x => for (elem <- x.drop(1){/*do stuff*/} )
Same as the glom but doesn’t have the overhead of putting everything into an array, since x is an iterator in this case
Specific to PySpark:
As per @maasg, you could do this:
header = rdd.first()
rdd.filter(lambda line: line != header)
but it’s not technically correct, as it’s possible you exclude lines containing data as well as the header. However, this seems to work for me:
def remove_header(itr_index, itr):
return iter(list(itr)[1:]) if itr_index == 0 else itr
rdd.mapPartitionsWithIndex(remove_header)
Similarly:
rdd.zipWithIndex().filter(lambda tup: tup[1] > 0).map(lambda tup: tup[0])
I’m new to Spark, so can’t intelligently comment about which will be fastest.
A straightforward way to achieve this in PySpark (Python API), assuming you are using Python 3:
noHeaderRDD = rawRDD.zipWithIndex().filter(lambda row_index: row_index[1] > 0).keys()
I have tested with spark2.1. Let’s say you want to remove first 14 rows without knowing about number of columns file has.
sc = spark.sparkContext
lines = sc.textFile("s3://location_of_csv")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])
withColumn is a df function. So below will not work in RDD style as used in above case.
parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)
I did some profiling with various solutions and have the following
Cluster Configuration
Clusters
- Cluster 1 : 4 Cores 16 GB
- Cluster 2 : 4 Cores 16 GB
- Cluster 3 : 4 Cores 16 GB
- Cluster 4 : 2 Cores 8 GB
Data
7 million rows, 4 columns
#Solution 1
# Time Taken : 40 ms
data=sc.TextFile('file1.txt')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
#Solution 2
# Time Taken : 3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
return iter(list(iterator)[1:]) if index==0 else iterator
data=data.mapPartitionsWithIndex(dropFirstRow)
#Solution 3
# Time Taken : 0.3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
if(index==0):
for subIndex,item in enumerate(iterator):
if subIndex > 0:
yield item
else:
yield iterator
data=data.mapPartitionsWithIndex(dropFirstRow)
I think that Solution 3 is the most scalable
how do you drop rows from an RDD in PySpark? Particularly the first row, since that tends to contain column names in my datasets. From perusing the API, I can’t seem to find an easy way to do this. Of course I could do this via Bash / HDFS, but I just want to know if this can be done from within PySpark.
AFAIK there’s no ‘easy’ way to do this.
This should do the trick, though:
val header = data.first
val rows = data.filter(line => line != header)
Personally I think just using a filter to get rid of this stuff is the easiest way. But per your comment I have another approach. Glom the RDD so each partition is an array (I’m assuming you have 1 file per partition, and each file has the offending row on top) and then just skip the first element (this is with the scala api).
data.glom().map(x => for (elem <- x.drop(1){/*do stuff*/}) //x is an array so just skip the 0th index
Keep in mind one of the big features of RDD’s is that they are immutable, so naturally removing a row is a tricky thing to do
UPDATE:
Better solution.
rdd.mapPartions(x => for (elem <- x.drop(1){/*do stuff*/} )
Same as the glom but doesn’t have the overhead of putting everything into an array, since x is an iterator in this case
Specific to PySpark:
As per @maasg, you could do this:
header = rdd.first()
rdd.filter(lambda line: line != header)
but it’s not technically correct, as it’s possible you exclude lines containing data as well as the header. However, this seems to work for me:
def remove_header(itr_index, itr):
return iter(list(itr)[1:]) if itr_index == 0 else itr
rdd.mapPartitionsWithIndex(remove_header)
Similarly:
rdd.zipWithIndex().filter(lambda tup: tup[1] > 0).map(lambda tup: tup[0])
I’m new to Spark, so can’t intelligently comment about which will be fastest.
A straightforward way to achieve this in PySpark (Python API), assuming you are using Python 3:
noHeaderRDD = rawRDD.zipWithIndex().filter(lambda row_index: row_index[1] > 0).keys()
I have tested with spark2.1. Let’s say you want to remove first 14 rows without knowing about number of columns file has.
sc = spark.sparkContext
lines = sc.textFile("s3://location_of_csv")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])
withColumn is a df function. So below will not work in RDD style as used in above case.
parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)
I did some profiling with various solutions and have the following
Cluster Configuration
Clusters
- Cluster 1 : 4 Cores 16 GB
- Cluster 2 : 4 Cores 16 GB
- Cluster 3 : 4 Cores 16 GB
- Cluster 4 : 2 Cores 8 GB
Data
7 million rows, 4 columns
#Solution 1
# Time Taken : 40 ms
data=sc.TextFile('file1.txt')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
#Solution 2
# Time Taken : 3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
return iter(list(iterator)[1:]) if index==0 else iterator
data=data.mapPartitionsWithIndex(dropFirstRow)
#Solution 3
# Time Taken : 0.3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
if(index==0):
for subIndex,item in enumerate(iterator):
if subIndex > 0:
yield item
else:
yield iterator
data=data.mapPartitionsWithIndex(dropFirstRow)
I think that Solution 3 is the most scalable