How to Set Pyspark Dataframe Headers to another Row?

Question:

I have a dataframe that looks like this:

# +----+------+---------+
# |col1| col2 |  col3   |
# +----+------+---------+
# |  id| name |    val  |
# |  1 |  a01 |    X    |
# |  2 |  a02 |    Y    |
# +---+-------+---------+

I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:

# +----+------+---------+
# | id | name |   val   |
# +----+------+---------+
# |  1 |  a01 |    X    |
# |  2 |  a02 |    Y    |
# +---+-------+---------+

The columns can be variable, so I can’t use the names to set them explicitly in the new dataframe. This is not using pandas df’s.

Asked By: Tibberzz

||

Answers:

Assuming that there is only one row with id in col1, name in col2 and val in col3, you can use the following logic (commented for clarity and explanation)

#select the row with the header name 
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))

#selecting the rest of the rows except the first one 
restDF = df.subtract(header)

#converting the header row into Row 
headerColumn = header.first()

#looping columns for renaming 
for column in restDF.columns:
    restDF = restDF.withColumnRenamed(column, headerColumn[column])

restDF.show(truncate=False)

this should give you

+---+----+---+
|id |name|val|
+---+----+---+
|1  |a01 |X  |
|2  |a02 |Y  |
+---+----+---+

But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source

Answered By: Ramesh Maharjan

Did you try this? header=True

from pyspark.sql import SparkSession
spark = SparkSession 
    .builder 
    .appName("Python Spark SQL basic example") 
    .getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)

Pyspark sets the column names as _c0, _c1, _c2 if the header is not set to True and it pushes the column down by one row.

Answered By: Sai Kiran

Thanks to @Sai Kiran!
The header=True works for me:

df = spark.read.csv("TSCAINV_062020.csv",header=True)

Answered By: Ryan Xu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.