Give prefix to all columns when selecting with 'struct_name.*'

Question:

The dataframe below is a temp_table named: ‘table_name’.
How would you use spark.sql() to give a prefix to all columns?

root
 |-- MAIN_COL: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: string (nullable = true)
 |    |-- c: string (nullable = true)
 |    |-- d: string (nullable = true)
 |    |-- f: long (nullable = true)
 |    |-- g: long (nullable = true)
 |    |-- h: long (nullable = true)
 |    |-- j: long (nullable = true)

The below query

spark.sql("select MAIN_COL.* from table_name")

gives back columns named a,b,c…, but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?

I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.

 # Generate a pandas DataFrame
import pandas as pd
a_dict={
    'a':[1,2,3,4,5],
    'b':[1,2,3,4,5],
    'c':[1,2,3,4,5],
    'e':list('abcde'),
    'f':list('abcde'),
    'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)

#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))
Asked By: Chris

||

Answers:

you can try this: add all the column as per requirements to schema2

val schema2 = new StructType()
    .add("pre_a",StringType)
    .add("pre_b",StringType)
    .add("pre_c",StringType) 

Now select column using like:

df.select(col("MAIN_COL").cast(schema2)).show()

it will give you all the updated column names.

Answered By: Mahesh Gupta

Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields to access the target fields. Next use python map to prepend pre_ to each field:

from pyspark.sql.types import *
from pyspark.sql.functions import col

inner_fields = main.schema.fields[0].dataType.fields

# [StructField(a,LongType,true),
#  StructField(b,LongType,true),
#  StructField(c,LongType,true),
#  StructField(e,StringType,true),
#  StructField(f,StringType,true),
#  StructField(g,StringType,true)]

pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))

new_schema = StructType(pre_cols)

main.select(col("MAIN_COL").cast(new_schema)).printSchema()

# root
#  |-- MAIN_COL: struct (nullable = false)
#  |    |-- pre_a: long (nullable = true)
#  |    |-- pre_b: long (nullable = true)
#  |    |-- pre_c: long (nullable = true)
#  |    |-- pre_e: string (nullable = true)
#  |    |-- pre_f: string (nullable = true)
#  |    |-- pre_g: string (nullable = true)

Finally, you can use cast with the new schema as @Mahesh already mentioned.

Answered By: abiratsis

Beauty of Spark, you can programatically manipulate metadata

This is an example that continues the original code snippet:

main.createOrReplaceTempView("table_name")

new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])

new_df = spark.sql(f"select {new_cols_select} from table_name")

Due to Spark’s laziness and because all the manipulations are metadata only, this code doesn’t have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).

It is also possible to get original column names in more elegant way with df.schema object

Answered By: Vapira

you can also do this with PySpark:

df.select([col(col_name).alias('prefix' + col_name) for col_name in df])

Answered By: sargupta

The following expands all struct columns adding as prefix the parent column name.

struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
    schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
    df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])

Test input:

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]

Result:

struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
    schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
    df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])

print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]
Answered By: ZygD