Give prefix to all columns when selecting with 'struct_name.*'
Question:
The dataframe below is a temp_table named: ‘table_name’.
How would you use spark.sql() to give a prefix to all columns?
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
The below query
spark.sql("select MAIN_COL.* from table_name")
gives back columns named a,b,c…, but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?
I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.
# Generate a pandas DataFrame
import pandas as pd
a_dict={
'a':[1,2,3,4,5],
'b':[1,2,3,4,5],
'c':[1,2,3,4,5],
'e':list('abcde'),
'f':list('abcde'),
'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)
#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))
Answers:
you can try this: add all the column as per requirements to schema2
val schema2 = new StructType()
.add("pre_a",StringType)
.add("pre_b",StringType)
.add("pre_c",StringType)
Now select column using like:
df.select(col("MAIN_COL").cast(schema2)).show()
it will give you all the updated column names.
Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields
to access the target fields. Next use python map
to prepend pre_
to each field:
from pyspark.sql.types import *
from pyspark.sql.functions import col
inner_fields = main.schema.fields[0].dataType.fields
# [StructField(a,LongType,true),
# StructField(b,LongType,true),
# StructField(c,LongType,true),
# StructField(e,StringType,true),
# StructField(f,StringType,true),
# StructField(g,StringType,true)]
pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))
new_schema = StructType(pre_cols)
main.select(col("MAIN_COL").cast(new_schema)).printSchema()
# root
# |-- MAIN_COL: struct (nullable = false)
# | |-- pre_a: long (nullable = true)
# | |-- pre_b: long (nullable = true)
# | |-- pre_c: long (nullable = true)
# | |-- pre_e: string (nullable = true)
# | |-- pre_f: string (nullable = true)
# | |-- pre_g: string (nullable = true)
Finally, you can use cast
with the new schema as @Mahesh already mentioned.
Beauty of Spark, you can programatically manipulate metadata
This is an example that continues the original code snippet:
main.createOrReplaceTempView("table_name")
new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])
new_df = spark.sql(f"select {new_cols_select} from table_name")
Due to Spark’s laziness and because all the manipulations are metadata only, this code doesn’t have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).
It is also possible to get original column names in more elegant way with df.schema
object
you can also do this with PySpark:
df.select([col(col_name).alias('prefix' + col_name) for col_name in df])
The following expands all struct columns adding as prefix the parent column name.
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
Test input:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]
Result:
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]
The dataframe below is a temp_table named: ‘table_name’.
How would you use spark.sql() to give a prefix to all columns?
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
The below query
spark.sql("select MAIN_COL.* from table_name")
gives back columns named a,b,c…, but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?
I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.
# Generate a pandas DataFrame
import pandas as pd
a_dict={
'a':[1,2,3,4,5],
'b':[1,2,3,4,5],
'c':[1,2,3,4,5],
'e':list('abcde'),
'f':list('abcde'),
'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)
#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))
you can try this: add all the column as per requirements to schema2
val schema2 = new StructType()
.add("pre_a",StringType)
.add("pre_b",StringType)
.add("pre_c",StringType)
Now select column using like:
df.select(col("MAIN_COL").cast(schema2)).show()
it will give you all the updated column names.
Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields
to access the target fields. Next use python map
to prepend pre_
to each field:
from pyspark.sql.types import *
from pyspark.sql.functions import col
inner_fields = main.schema.fields[0].dataType.fields
# [StructField(a,LongType,true),
# StructField(b,LongType,true),
# StructField(c,LongType,true),
# StructField(e,StringType,true),
# StructField(f,StringType,true),
# StructField(g,StringType,true)]
pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))
new_schema = StructType(pre_cols)
main.select(col("MAIN_COL").cast(new_schema)).printSchema()
# root
# |-- MAIN_COL: struct (nullable = false)
# | |-- pre_a: long (nullable = true)
# | |-- pre_b: long (nullable = true)
# | |-- pre_c: long (nullable = true)
# | |-- pre_e: string (nullable = true)
# | |-- pre_f: string (nullable = true)
# | |-- pre_g: string (nullable = true)
Finally, you can use cast
with the new schema as @Mahesh already mentioned.
Beauty of Spark, you can programatically manipulate metadata
This is an example that continues the original code snippet:
main.createOrReplaceTempView("table_name")
new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])
new_df = spark.sql(f"select {new_cols_select} from table_name")
Due to Spark’s laziness and because all the manipulations are metadata only, this code doesn’t have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).
It is also possible to get original column names in more elegant way with df.schema
object
you can also do this with PySpark:
df.select([col(col_name).alias('prefix' + col_name) for col_name in df])
The following expands all struct columns adding as prefix the parent column name.
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
Test input:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]
Result:
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]