Error in defining pyspark datastructure variables with a for loop

Question:

I would like to define a set of pyspark features as a run time variables (features).
I tried the below, it throws an error. Could you please help on this

colNames = ['colA', 'colB', 'colC', 'colD', 'colE']  
tsfresh_feature_set = StructType(
      [
        
        StructField('field1', StringType(), True),     
        StructField('field2', StringType(), True),        
        StructField(item, DoubleType(), False) for item in colNames
        
      ]
    )

Error that I get:

SyntaxError: invalid syntax
  File "<command-621368>", line 9
    StructField(item, DoubleType(), False) for item in colNames
                                             ^
SyntaxError: invalid syntax
Asked By: Arun

||

Answers:

You are trying to use list comprehension for creating structure using list of column names for your DataFrame

StructField(item, DoubleType(), False) for item in colNames

But the problem is with the syntax:

  1. Wrap your code with []
[StructField(item, DoubleType(), False) for item in colNames]
  1. Unwrap the elements inside the list using *
colNames = ['colA', 'colB', 'colC', 'colD', 'colE']  
tsfresh_feature_set = StructType(
      [        
        StructField('field1', StringType(), True),     
        StructField('field2', StringType(), True),
        *[StructField(item, DoubleType(), False) for item in colNames]
      ]
    )
Answered By: arudsekaberne
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.