How to Set Pyspark Dataframe Headers to another Row?
Question:
I have a dataframe that looks like this:
# +----+------+---------+
# |col1| col2 | col3 |
# +----+------+---------+
# | id| name | val |
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:
# +----+------+---------+
# | id | name | val |
# +----+------+---------+
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
The columns can be variable, so I can’t use the names to set them explicitly in the new dataframe. This is not using pandas df’s.
Answers:
Assuming that there is only one row with id
in col1, name
in col2 and val
in col3, you can use the following logic (commented for clarity and explanation)
#select the row with the header name
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))
#selecting the rest of the rows except the first one
restDF = df.subtract(header)
#converting the header row into Row
headerColumn = header.first()
#looping columns for renaming
for column in restDF.columns:
restDF = restDF.withColumnRenamed(column, headerColumn[column])
restDF.show(truncate=False)
this should give you
+---+----+---+
|id |name|val|
+---+----+---+
|1 |a01 |X |
|2 |a02 |Y |
+---+----+---+
But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source
Did you try this? header=True
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName("Python Spark SQL basic example")
.getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Pyspark sets the column names as _c0, _c1, _c2 if the header is not set to True and it pushes the column down by one row.
Thanks to @Sai Kiran!
The header=True
works for me:
df = spark.read.csv("TSCAINV_062020.csv",header=True)
I have a dataframe that looks like this:
# +----+------+---------+
# |col1| col2 | col3 |
# +----+------+---------+
# | id| name | val |
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:
# +----+------+---------+
# | id | name | val |
# +----+------+---------+
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
The columns can be variable, so I can’t use the names to set them explicitly in the new dataframe. This is not using pandas df’s.
Assuming that there is only one row with id
in col1, name
in col2 and val
in col3, you can use the following logic (commented for clarity and explanation)
#select the row with the header name
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))
#selecting the rest of the rows except the first one
restDF = df.subtract(header)
#converting the header row into Row
headerColumn = header.first()
#looping columns for renaming
for column in restDF.columns:
restDF = restDF.withColumnRenamed(column, headerColumn[column])
restDF.show(truncate=False)
this should give you
+---+----+---+
|id |name|val|
+---+----+---+
|1 |a01 |X |
|2 |a02 |Y |
+---+----+---+
But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source
Did you try this? header=True
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName("Python Spark SQL basic example")
.getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Pyspark sets the column names as _c0, _c1, _c2 if the header is not set to True and it pushes the column down by one row.
Thanks to @Sai Kiran!
The header=True
works for me:
df = spark.read.csv("TSCAINV_062020.csv",header=True)