PySpark Drop Rows

Question:

how do you drop rows from an RDD in PySpark? Particularly the first row, since that tends to contain column names in my datasets. From perusing the API, I can’t seem to find an easy way to do this. Of course I could do this via Bash / HDFS, but I just want to know if this can be done from within PySpark.

Asked By: Jack

||

Answers:

AFAIK there’s no ‘easy’ way to do this.

This should do the trick, though:

val header = data.first
val rows = data.filter(line => line != header)
Answered By: maasg

Personally I think just using a filter to get rid of this stuff is the easiest way. But per your comment I have another approach. Glom the RDD so each partition is an array (I’m assuming you have 1 file per partition, and each file has the offending row on top) and then just skip the first element (this is with the scala api).

data.glom().map(x => for (elem <- x.drop(1){/*do stuff*/}) //x is an array so just skip the 0th index

Keep in mind one of the big features of RDD’s is that they are immutable, so naturally removing a row is a tricky thing to do

UPDATE:
Better solution.
rdd.mapPartions(x => for (elem <- x.drop(1){/*do stuff*/} )
Same as the glom but doesn’t have the overhead of putting everything into an array, since x is an iterator in this case

Answered By: aaronman

Specific to PySpark:

As per @maasg, you could do this:

header = rdd.first()
rdd.filter(lambda line: line != header)

but it’s not technically correct, as it’s possible you exclude lines containing data as well as the header. However, this seems to work for me:

def remove_header(itr_index, itr):
    return iter(list(itr)[1:]) if itr_index == 0 else itr
rdd.mapPartitionsWithIndex(remove_header)

Similarly:

rdd.zipWithIndex().filter(lambda tup: tup[1] > 0).map(lambda tup: tup[0])

I’m new to Spark, so can’t intelligently comment about which will be fastest.

Answered By: user4081921

A straightforward way to achieve this in PySpark (Python API), assuming you are using Python 3:

noHeaderRDD = rawRDD.zipWithIndex().filter(lambda row_index: row_index[1] > 0).keys()
Answered By: noleto

I have tested with spark2.1. Let’s say you want to remove first 14 rows without knowing about number of columns file has.

sc = spark.sparkContext
lines = sc.textFile("s3://location_of_csv")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])

withColumn is a df function. So below will not work in RDD style as used in above case.

parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)
Answered By: kartik

I did some profiling with various solutions and have the following

Cluster Configuration

Clusters

  • Cluster 1 : 4 Cores 16 GB
  • Cluster 2 : 4 Cores 16 GB
  • Cluster 3 : 4 Cores 16 GB
  • Cluster 4 : 2 Cores 8 GB

Data

7 million rows, 4 columns

#Solution 1
# Time Taken : 40 ms
data=sc.TextFile('file1.txt')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)

#Solution 2
# Time Taken : 3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
     return iter(list(iterator)[1:]) if index==0 else iterator
data=data.mapPartitionsWithIndex(dropFirstRow)

#Solution 3
# Time Taken : 0.3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
     if(index==0):
          for subIndex,item in enumerate(iterator):
               if subIndex > 0:
                    yield item
     else:
          yield iterator

data=data.mapPartitionsWithIndex(dropFirstRow)

I think that Solution 3 is the most scalable

Answered By: Anant Gupta
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.