pyspark createDataframe typeerror: structtype can not accept object 'id' in type <class 'str'>

Question:

An API call is returning DICT type response similar to the output below:

{‘Account’: {‘id’: 123, ‘externalIdentifier’: None, ‘name’:
‘test acct’, ‘accountNumber’: None, ‘Rep’: None,
‘organizationId’: 123, ‘streetAddress’: ‘123 Main Road’,
‘streetAddressCity’: ‘Town City’, ‘streetAddressState’: ‘Texas’,
‘streetAddressZipCode’: ‘76123’, ‘contact’: [{‘id’: 10001, ‘name’:
‘Test test’, ‘extID’: ‘9999999999’}]}}

I am attempting to build a dataframe of the Account record returned but I keep getting TypeError: StructType can not accept object ‘id’ in type <class ‘str’>. I have tried the other methods which include adding .item(), map lambda and converting types, but always coming back to the same error.

account_schema = StructType([
StructField('id', StringType(), True),
StructField('externalIdentifier', StringType(), True),
StructField('name', StringType(), True),
StructField('Account_number', StringType(), True),    
StructField('Rep', StructType([
    StructField('firstName', StringType(), True),
    StructField('lastName', StringType(), True),
    StructField('email', StringType(), True),
    StructField('id', StringType(), True),
])),
StructField('streetAddress', StringType(), True),   
StructField('streetAddressCity', StringType(), True),   
StructField('streetAddressState', StringType(), True),   
StructField('streetAddressZipCode', StringType(), True)  ])


df = spark.createDataFrame(account_response['Account'], schema=account_schema)

Any direction would be appreciated.

Asked By: Alen Giliana

||

Answers:

The reason is that the data type of data argument should be RDD or some kind of iterable like list, array etc. as per official documentation.

If you enclose your data in square brackets then you get a spark dataframe with one record.

spark.createDataFrame(data=[account_response['Account']], schema=account_schema)

Full working example:

account_response = {'Account': {'id': 123, 'externalIdentifier': None, 'name': 'test acct', 'accountNumber': None, 'Rep': None, 'organizationId': 123, 'streetAddress': '123 Main Road', 'streetAddressCity': 'Town City', 'streetAddressState': 'Texas', 'streetAddressZipCode': '76123', 'contact': [{'id': 10001, 'name': 'Test test', 'extID': '9999999999'}]}}

account_schema = StructType([
    StructField('id', StringType(), True),
    StructField('externalIdentifier', StringType(), True),
    StructField('name', StringType(), True),
    StructField('Account_number', StringType(), True),    
    StructField('Rep', StructType([
        StructField('firstName', StringType(), True),
        StructField('lastName', StringType(), True),
        StructField('email', StringType(), True),
        StructField('id', StringType(), True),
    ])),
    StructField('streetAddress', StringType(), True),   
    StructField('streetAddressCity', StringType(), True),   
    StructField('streetAddressState', StringType(), True),   
    StructField('streetAddressZipCode', StringType(), True)  
])


df = spark.createDataFrame(data=[account_response['Account']], schema=account_schema)

df.show(truncate=False)

Output:

+---+------------------+---------+--------------+----+-------------+-----------------+------------------+--------------------+
|id |externalIdentifier|name     |Account_number|Rep |streetAddress|streetAddressCity|streetAddressState|streetAddressZipCode|
+---+------------------+---------+--------------+----+-------------+-----------------+------------------+--------------------+
|123|null              |test acct|null          |null|123 Main Road|Town City        |Texas             |76123               |
+---+------------------+---------+--------------+----+-------------+-----------------+------------------+--------------------+
Answered By: Azhar Khan