Iterate through each column and find the max length
Question:
I want to get the maximum length from each column from a pyspark dataframe.
Following is the sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df = spark.createDataFrame(data=data2,schema=schema)
I tried to implement the solution provided in Scala but could not convert it.
Answers:
This would work
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))) for name in df.schema.names])
Result
Edit: For reference: Converting to Rows (As asked here, updated there as well – pyspark max string length for each column in the dataframe)
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:
I want to get the maximum length from each column from a pyspark dataframe.
Following is the sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df = spark.createDataFrame(data=data2,schema=schema)
I tried to implement the solution provided in Scala but could not convert it.
This would work
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))) for name in df.schema.names])
Result
Edit: For reference: Converting to Rows (As asked here, updated there as well – pyspark max string length for each column in the dataframe)
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output: