How to determine what are the columns I need since ApplyMapping is'nt case sensitive?

Question

I’m updating a Pyspark script with a new Database model and I’ve encountered some problems calling/updating columns since PySpark apparently brings all columns in uppercase but when I use ApplyMapping it is not case sensitive BUT when I join(By left) with another table it is case sensitive and I end up with multiple columns with the same name but one of them in uppercase and the other one in lowercase and I want to use SelectFields function.

I’ve tried the exact same name (Case sensitive) of the columns but it always brings the same.

I’ve tried printing the schema but the only difference is only the case.

testDF = testDF.join(test2DF, "COLUMN",how='left')

test3DF=test3DF.withColumn("column", test3DF['ATTRB'].substr(4,4))

fullDF= testDF.join(test3DF, (testDF['ID'] == test3DF['ID']) )

…..

applymappingTest = ApplyMapping.apply(frame = fullDF, mappings = [
    ('ID', 'string', 'identifier', 'string')
    ('column', 'string', 'myColumn', 'string')
    ('COLUMN', 'string', 'someother', 'string')
    ], transformation_ctx = "applymappingTest")

……

selectfieldsTest= SelectFields.apply(frame = applymappingTest, paths = [
     "identifier",
     "myColumn",
], transformation_ctx = "selectfieldsTest")

Expected result:
myColumn is the column with the name in lowercase.
Actual result: 
myColumn is the column with the name in uppercase.

Asked By: Esteban Echavarrìa

||

Source

Answer 1

You can set caseSensitive option in applymapping.

def applyMapping( mappings : Seq[Product4[String, String, String, String]], caseSensitive : Boolean = true, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame

https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping

Btw, if I am not wrong, applyMapping is casesensitive by default. But spark SQL is case insensitive by default. To set casesensitive in spark SQL, you may use:
spark_session.sql('set spark.sql.caseSensitive=true')

Answered By: Sandeep Fatangare

Answer 2

I had the same issue, and at the end I found that SelectFields component in AWS Glue is a case-sensitive and some others like Join component.

I am wondering if there is a global parameter of case sensitivity to set or add to the job’s initial parameters ?

The generated script of AWS Glue Studio don’t consider that SelectFields is case-sensitive, and that caused some execution issues and some columns not found :
The error : Cannot resolve column name "XXX" among ()

Answered By: Messaoud Bahaa

How to determine what are the columns I need since ApplyMapping is'nt case sensitive?

Question:

Answers: