Dataframe column name with $$ failing in filter condition with parse error

Question:

I have dataframe with column names as "lastname$$" and "firstname$$"

+-----------+----------+----------+------------------+-----+------+
|firstname$$|middlename|lastname$$|languages         |state|gender|
+-----------+----------+----------+------------------+-----+------+
|James      |          |Smith     |[Java, Scala, C++]|OH   |M     |
|Anna       |Rose      |          |[Spark, Java, C++]|NY   |F     |
|Julia      |          |Williams  |[CSharp, VB]      |OH   |F     |
|Maria      |Anne      |Jones     |[CSharp, VB]      |NY   |M     |
|Jen        |Mary      |Brown     |[CSharp, VB]      |NY   |M     |
|Mike       |Mary      |Williams  |[Python, VB]      |OH   |M     |
+-----------+----------+----------+------------------+-----+------+

I am trying to filter this dataframe for firstname$$ = "James" and getting following error.

{
   "name": "ParseException",
   "message": "nSyntax error at or near '$'(line 1, pos 9)nn== SQL ==nfirstname$$ == 'James'n---------^^^n",
   "stack": "u001b[1;31m---------------------------------------------------------------------------u001b[0mnu001b[1;31mParseExceptionu001b[0m                            Traceback (most recent call last)nu001b[1;32md:\Users\sample.ipynb Cell 3u001b[0m in u001b[0;36m<cell line: 1>u001b[1;34m()u001b[0mnu001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a>u001b[0m df_sam1 u001b[39m=u001b[39m df_samu001b[39m.u001b[39;49mwhere(u001b[39m"u001b[39;49mu001b[39mfirstname$$ == u001b[39;49mu001b[39m'u001b[39;49mu001b[39mJamesu001b[39;49mu001b[39m'u001b[39;49mu001b[39m"u001b[39;49m)nu001b[0;32m      <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a>u001b[0m df_sam1u001b[39m.u001b[39mshow()nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\pyspark\sql\dataframe.py:2077u001b[0m, in u001b[0;36mDataFrame.filteru001b[1;34m(self, condition)u001b[0mnu001b[0;32m   2052u001b[0m u001b[39m"""Filters rows using the given condition.u001b[39;00mnu001b[0;32m   2053u001b[0m nu001b[0;32m   2054u001b[0m u001b[39m:func:`where` is an alias for :func:`filter`.u001b[39;00mnu001b[1;32m   (...)u001b[0mnu001b[0;32m   2074u001b[0m u001b[39m[Row(age=2, name='Alice')]u001b[39;00mnu001b[0;32m   2075u001b[0m u001b[39m"""u001b[39;00mnu001b[0;32m   2076u001b[0m u001b[39mifu001b[39;00m u001b[39misinstanceu001b[39m(condition, u001b[39mstru001b[39m):nu001b[1;32m-> 2077u001b[0m     jdf u001b[39m=u001b[39m u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49m_jdfu001b[39m.u001b[39;49mfilter(condition)nu001b[0;32m   2078u001b[0m u001b[39melifu001b[39;00m u001b[39misinstanceu001b[39m(condition, Column):nu001b[0;32m   2079u001b[0m     jdf u001b[39m=u001b[39m u001b[39mselfu001b[39mu001b[39m.u001b[39m_jdfu001b[39m.u001b[39mfilter(conditionu001b[39m.u001b[39m_jc)nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py:1321u001b[0m, in u001b[0;36mJavaMember.__call__u001b[1;34m(self, *args)u001b[0mnu001b[0;32m   1315u001b[0m command u001b[39m=u001b[39m protou001b[39m.u001b[39mCALL_COMMAND_NAME u001b[39m+u001b[39m\nu001b[0;32m   1316u001b[0m     u001b[39mselfu001b[39mu001b[39m.u001b[39mcommand_header u001b[39m+u001b[39m\nu001b[0;32m   1317u001b[0m     args_command u001b[39m+u001b[39m\nu001b[0;32m   1318u001b[0m     protou001b[39m.u001b[39mEND_COMMAND_PARTnu001b[0;32m   1320u001b[0m answer u001b[39m=u001b[39m u001b[39mselfu001b[39mu001b[39m.u001b[39mgateway_clientu001b[39m.u001b[39msend_command(command)nu001b[1;32m-> 1321u001b[0m return_value u001b[39m=u001b[39m get_return_value(nu001b[0;32m   1322u001b[0m     answer, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mgateway_client, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mtarget_id, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mname)nu001b[0;32m   1324u001b[0m u001b[39mforu001b[39;00m temp_arg u001b[39minu001b[39;00m temp_args:nu001b[0;32m   1325u001b[0m     temp_argu001b[39m.u001b[39m_detach()nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\pyspark\sql\utils.py:196u001b[0m, in u001b[0;36mcapture_sql_exception.<locals>.decou001b[1;34m(*a, **kw)u001b[0mnu001b[0;32m    192u001b[0m converted u001b[39m=u001b[39m convert_exception(eu001b[39m.u001b[39mjava_exception)nu001b[0;32m    193u001b[0m u001b[39mifu001b[39;00m u001b[39mnotu001b[39;00m u001b[39misinstanceu001b[39m(converted, UnknownException):nu001b[0;32m    194u001b[0m     u001b[39m# Hide where the exception came from that shows a non-Pythonicu001b[39;00mnu001b[0;32m    195u001b[0m     u001b[39m# JVM exception message.u001b[39;00mnu001b[1;32m--> 196u001b[0m     u001b[39mraiseu001b[39;00m converted u001b[39mfromu001b[39;00m u001b[39mNoneu001b[39mnu001b[0;32m    197u001b[0m u001b[39melseu001b[39;00m:nu001b[0;32m    198u001b[0m     u001b[39mraiseu001b[39;00mnnu001b[1;31mParseExceptionu001b[0m: nSyntax error at or near '$'(line 1, pos 9)nn== SQL ==nfirstname$$ == 'James'n---------^^^n"
}

Here is the code I am trying.

df_sam1 = df_sam.where("firstname$$ == 'James'")
df_sam1.show()

I tried to rename the column to fitstname and still getting the same error.

Can anyone look at it and help me to resolve this issue.

Thanks,
Bab

Asked By: Bab

||

Answers:

In Spark SQL $ character is not allowed in column names when utilizing SQL expressions, see this. To circumvent this problem, you can either employ DataFrame API functions for filtering the data:

from pyspark.sql.functions import col

df_sam1 = df_sam.filter(col("firstname$$") == 'James')
df_sam1.show()

or rename the column:

df_sam_renamed = df_sam.withColumnRenamed("firstname$$", "firstname").withColumnRenamed("lastname$$", "lastname")

df_sam1 = df_sam_renamed.where("firstname == 'James'")
df_sam1.show()
Answered By: angwrk

mask by the value of ‘firstname$$’s value like so:

print(df_sam1[df_sam1['firstname$$'] == 'James']])
Answered By: Tanner Firl