PySpark Drop Rows

Question

how do you drop rows from an RDD in PySpark? Particularly the first row, since that tends to contain column names in my datasets. From perusing the API, I can’t seem to find an easy way to do this. Of course I could do this via Bash / HDFS, but I just want to know if this can be done from within PySpark.

Asked By: Jack

||

Source

Answer 1

AFAIK there’s no ‘easy’ way to do this.

This should do the trick, though:

val header = data.first
val rows = data.filter(line => line != header)

Answered By: maasg

Answer 2

Personally I think just using a filter to get rid of this stuff is the easiest way. But per your comment I have another approach. Glom the RDD so each partition is an array (I’m assuming you have 1 file per partition, and each file has the offending row on top) and then just skip the first element (this is with the scala api).

data.glom().map(x => for (elem <- x.drop(1){/*do stuff*/}) //x is an array so just skip the 0th index

Keep in mind one of the big features of RDD’s is that they are immutable, so naturally removing a row is a tricky thing to do

UPDATE:
Better solution.
rdd.mapPartions(x => for (elem <- x.drop(1){/*do stuff*/} )
Same as the glom but doesn’t have the overhead of putting everything into an array, since x is an iterator in this case

Answered By: aaronman

Answer 3

Specific to PySpark:

As per @maasg, you could do this:

header = rdd.first()
rdd.filter(lambda line: line != header)

but it’s not technically correct, as it’s possible you exclude lines containing data as well as the header. However, this seems to work for me:

def remove_header(itr_index, itr):
    return iter(list(itr)[1:]) if itr_index == 0 else itr
rdd.mapPartitionsWithIndex(remove_header)

Similarly:

rdd.zipWithIndex().filter(lambda tup: tup[1] > 0).map(lambda tup: tup[0])

I’m new to Spark, so can’t intelligently comment about which will be fastest.

Answered By: user4081921

Answer 4

A straightforward way to achieve this in PySpark (Python API), assuming you are using Python 3:

noHeaderRDD = rawRDD.zipWithIndex().filter(lambda row_index: row_index[1] > 0).keys()

Answered By: noleto

Answer 5

I have tested with spark2.1. Let’s say you want to remove first 14 rows without knowing about number of columns file has.

sc = spark.sparkContext
lines = sc.textFile("s3://location_of_csv")
parts = lines.map(lambda l: l.split(","))
parts.zipWithIndex().filter(lambda tup: tup[1] > 14).map(lambda x:x[0])

withColumn is a df function. So below will not work in RDD style as used in above case.

parts.withColumn("index",monotonically_increasing_id()).filter(index > 14)

Answered By: kartik

Answer 6

I did some profiling with various solutions and have the following

Cluster Configuration

Clusters

Cluster 1 : 4 Cores 16 GB
Cluster 2 : 4 Cores 16 GB
Cluster 3 : 4 Cores 16 GB
Cluster 4 : 2 Cores 8 GB

Data

7 million rows, 4 columns

#Solution 1
# Time Taken : 40 ms
data=sc.TextFile('file1.txt')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)

#Solution 2
# Time Taken : 3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
     return iter(list(iterator)[1:]) if index==0 else iterator
data=data.mapPartitionsWithIndex(dropFirstRow)

#Solution 3
# Time Taken : 0.3 seconds
data=sc.TextFile('file1.txt')
def dropFirstRow(index,iterator):
     if(index==0):
          for subIndex,item in enumerate(iterator):
               if subIndex > 0:
                    yield item
     else:
          yield iterator

data=data.mapPartitionsWithIndex(dropFirstRow)

I think that Solution 3 is the most scalable

Answered By: Anant Gupta

PySpark Drop Rows

Question:

Answers:

Clusters

Data