How to change dataframe column names in PySpark?

Question:

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn’t work in PySpark dataframes created using sqlContext.
The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas?

My Spark version is 1.5.0

Asked By: Shubhanshu Mishra

||

Answers:

There are many ways to do that:

  • Option 1. Using selectExpr.

     data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                       ["Name", "askdaosdka"])
     data.show()
     data.printSchema()
    
     # Output
     #+-------+----------+
     #|   Name|askdaosdka|
     #+-------+----------+
     #|Alberto|         2|
     #| Dakota|         2|
     #+-------+----------+
    
     #root
     # |-- Name: string (nullable = true)
     # |-- askdaosdka: long (nullable = true)
    
     df = data.selectExpr("Name as name", "askdaosdka as age")
     df.show()
     df.printSchema()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
     #root
     # |-- name: string (nullable = true)
     # |-- age: long (nullable = true)
    
  • Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

     from functools import reduce
    
     oldColumns = data.schema.names
     newColumns = ["name", "age"]
    
     df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
     df.printSchema()
     df.show()
    
  • Option 3. using
    alias, in Scala you can also use as.

     from pyspark.sql.functions import col
    
     data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
     data.show()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
  • Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

     sqlContext.registerDataFrameAsTable(data, "myTable")
     df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
    
     df2.show()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
Answered By: Alberto Bonsanto
df = df.withColumnRenamed("colName", "newColName")
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

Answered By: Pankaj Kumar

If you want to rename a single column and keep the rest as it is:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])
Answered By: Ratul Ghosh

If you want to change all columns names, try df.toDF(*cols)

Answered By: user8117731

For a single column rename, you can still use toDF(). For example,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()
Answered By: ganeiy

we can use col.alias for renaming the column:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()
Answered By: mike

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

Answered By: pbahr

Another way to rename just one column (using import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')
Answered By: scottlittle

df.withColumnRenamed('age', 'age2')

Answered By: Sahan Jayasumana

this is the approach that I used:

create pyspark session:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+
Answered By: Grant Shannon

I made an easy to use function to rename multiple columns for a pyspark dataframe,
in case anyone wants to use it:

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)

Be careful, both lists must be the same length.

Answered By: Manrique

You can use the following function to rename all the columns of your dataframe.

def df_col_rename(X, to_rename, replace_with):
    """
    :param X: spark dataframe
    :param to_rename: list of original names
    :param replace_with: list of new names
    :return: dataframe with updated names
    """
    import pyspark.sql.functions as F
    mapping = dict(zip(to_rename, replace_with))
    X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
    return X

In case you need to update only a few columns’ names, you can use the same column name in the replace_with list

To rename all columns

df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

To rename a some columns

df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])
Answered By: Clock Slave

We can use various approaches to rename the column name.

First, let create a simple DataFrame.

df = spark.createDataFrame([("x", 1), ("y", 2)], 
                                  ["col_1", "col_2"])

Now let’s try to rename col_1 to col_3. PFB a few approaches to do the same.

# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()

# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()

# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()

# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()

Here is the output.

+-----+-----+
|col_3|col_2|
+-----+-----+
|    x|    1|
|    y|    2|
+-----+-----+

I hope this helps.

Answered By: Neeraj Bhadani

There are multiple approaches you can use:

  1. df1=df.withColumn("new_column","old_column").drop(col("old_column"))

  2. df1=df.withColumn("new_column","old_column")

  3. df1=df.select("old_column".alias("new_column"))

Answered By: pankajs

You can put into for loop, and use zip to pairs each column name in two array.

new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]

new_df = df
for old, new in zip(df.columns, new_name):
    new_df = new_df.withColumnRenamed(old, new)
Answered By: Haha TTpro

I like to use a dict to rename the df.

rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
    df = df.withColumnRenamed(col, rename[col])
Answered By: Michael H.

Method 1:

df = df.withColumnRenamed("old_column_name", "new_column_name")

Method 2:
If you want to do some computation and rename the new values

df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")
Answered By: Gourav Bansal

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = StructType([ 
    StructField("employee_name",StringType(),True), 
    StructField("department",StringType(),True), 
    StructField("state",StringType(),True), 
    StructField("salary", IntegerType(), True), 
    StructField("age", StringType(), True), 
    StructField("bonus", IntegerType(), True) 
  ])

 
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)

OurData.show()

# COMMAND ----------

GrouppedBonusData=OurData.groupBy("department").sum("bonus")


# COMMAND ----------

GrouppedBonusData.show()


# COMMAND ----------

GrouppedBonusData.printSchema()

# COMMAND ----------

from pyspark.sql.functions import col

BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()

# COMMAND ----------

GrouppedBonusData.groupBy("department").count().show()

# COMMAND ----------

GrouppedSalaryData=OurData.groupBy("department").sum("salary")

# COMMAND ----------

GrouppedSalaryData.show()

# COMMAND ----------

from pyspark.sql.functions import col

SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()

Answered By: Aviral Bhardwaj

A way that you can use ‘alias’ to change the column name:

col('my_column').alias('new_name')

Another way that you can use ‘alias’ (possibly not mentioned):

df.my_column.alias('new_name')
Answered By: lfvv

Try the following method. The following method can allow you rename columns of multiple files

Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/

df_initial = spark.read.load('com.databricks.spark.csv')
    
    rename_dict = {
      'Alberto':'Name',
      'Dakota':'askdaosdka'
    }
    
    df_renamed = df_initial 
    .select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])

    
     rename_dict = {
       'FName':'FirstName',
       'LName':'LastName',
       'DOB':'BirthDate'
        }

     return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])


df_renamed = spark.read.load('/mnt/datalake/bronze/testData') 
.transform(renameColumns)
Answered By: Dicer

The simplest solution is using withColumnRenamed:

renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()

And if you would like to do this like we do with Pandas, you can use toDF:

Create an order of list of new columns and pass it to toDF

df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()

Answered By: sargupta

This is an easy way to rename multiple columns with a loop:

cols_to_rename = ["col1","col2","col3"]

for col in cols_to_rename:
  df = df.withColumnRenamed(col,"new_{}".format(col))
Answered By: Rayanaay

List comprehension + f-string:

df = df.toDF(*[f'n_{c}' for c in df.columns])

Simple list comprehension:

df = df.toDF(*[c.lower() for c in df.columns])
Answered By: ZygD

The closest statement to df.columns = new_column_name_list is:

import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new) 
                 for (name_old, name_new) 
                 in zip(df.columns, new_column_name_list)]

This doesn’t require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:

import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new) 
                  for (name_old, name_new) 
                  in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)
Answered By: John Haberstroh

f.withColumnRenamed(‘name’, ‘rename’)