Dataframe column name with $$ failing in filter condition with parse error
Question:
I have dataframe with column names as "lastname$$" and "firstname$$"
+-----------+----------+----------+------------------+-----+------+
|firstname$$|middlename|lastname$$|languages |state|gender|
+-----------+----------+----------+------------------+-----+------+
|James | |Smith |[Java, Scala, C++]|OH |M |
|Anna |Rose | |[Spark, Java, C++]|NY |F |
|Julia | |Williams |[CSharp, VB] |OH |F |
|Maria |Anne |Jones |[CSharp, VB] |NY |M |
|Jen |Mary |Brown |[CSharp, VB] |NY |M |
|Mike |Mary |Williams |[Python, VB] |OH |M |
+-----------+----------+----------+------------------+-----+------+
I am trying to filter this dataframe for firstname$$ = "James" and getting following error.
{
"name": "ParseException",
"message": "nSyntax error at or near '$'(line 1, pos 9)nn== SQL ==nfirstname$$ == 'James'n---------^^^n",
"stack": "u001b[1;31m---------------------------------------------------------------------------u001b[0mnu001b[1;31mParseExceptionu001b[0m Traceback (most recent call last)nu001b[1;32md:\Users\sample.ipynb Cell 3u001b[0m in u001b[0;36m<cell line: 1>u001b[1;34m()u001b[0mnu001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a>u001b[0m df_sam1 u001b[39m=u001b[39m df_samu001b[39m.u001b[39;49mwhere(u001b[39m"u001b[39;49mu001b[39mfirstname$$ == u001b[39;49mu001b[39m'u001b[39;49mu001b[39mJamesu001b[39;49mu001b[39m'u001b[39;49mu001b[39m"u001b[39;49m)nu001b[0;32m <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a>u001b[0m df_sam1u001b[39m.u001b[39mshow()nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\pyspark\sql\dataframe.py:2077u001b[0m, in u001b[0;36mDataFrame.filteru001b[1;34m(self, condition)u001b[0mnu001b[0;32m 2052u001b[0m u001b[39m"""Filters rows using the given condition.u001b[39;00mnu001b[0;32m 2053u001b[0m nu001b[0;32m 2054u001b[0m u001b[39m:func:`where` is an alias for :func:`filter`.u001b[39;00mnu001b[1;32m (...)u001b[0mnu001b[0;32m 2074u001b[0m u001b[39m[Row(age=2, name='Alice')]u001b[39;00mnu001b[0;32m 2075u001b[0m u001b[39m"""u001b[39;00mnu001b[0;32m 2076u001b[0m u001b[39mifu001b[39;00m u001b[39misinstanceu001b[39m(condition, u001b[39mstru001b[39m):nu001b[1;32m-> 2077u001b[0m jdf u001b[39m=u001b[39m u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49m_jdfu001b[39m.u001b[39;49mfilter(condition)nu001b[0;32m 2078u001b[0m u001b[39melifu001b[39;00m u001b[39misinstanceu001b[39m(condition, Column):nu001b[0;32m 2079u001b[0m jdf u001b[39m=u001b[39m u001b[39mselfu001b[39mu001b[39m.u001b[39m_jdfu001b[39m.u001b[39mfilter(conditionu001b[39m.u001b[39m_jc)nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py:1321u001b[0m, in u001b[0;36mJavaMember.__call__u001b[1;34m(self, *args)u001b[0mnu001b[0;32m 1315u001b[0m command u001b[39m=u001b[39m protou001b[39m.u001b[39mCALL_COMMAND_NAME u001b[39m+u001b[39m\nu001b[0;32m 1316u001b[0m u001b[39mselfu001b[39mu001b[39m.u001b[39mcommand_header u001b[39m+u001b[39m\nu001b[0;32m 1317u001b[0m args_command u001b[39m+u001b[39m\nu001b[0;32m 1318u001b[0m protou001b[39m.u001b[39mEND_COMMAND_PARTnu001b[0;32m 1320u001b[0m answer u001b[39m=u001b[39m u001b[39mselfu001b[39mu001b[39m.u001b[39mgateway_clientu001b[39m.u001b[39msend_command(command)nu001b[1;32m-> 1321u001b[0m return_value u001b[39m=u001b[39m get_return_value(nu001b[0;32m 1322u001b[0m answer, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mgateway_client, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mtarget_id, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mname)nu001b[0;32m 1324u001b[0m u001b[39mforu001b[39;00m temp_arg u001b[39minu001b[39;00m temp_args:nu001b[0;32m 1325u001b[0m temp_argu001b[39m.u001b[39m_detach()nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\pyspark\sql\utils.py:196u001b[0m, in u001b[0;36mcapture_sql_exception.<locals>.decou001b[1;34m(*a, **kw)u001b[0mnu001b[0;32m 192u001b[0m converted u001b[39m=u001b[39m convert_exception(eu001b[39m.u001b[39mjava_exception)nu001b[0;32m 193u001b[0m u001b[39mifu001b[39;00m u001b[39mnotu001b[39;00m u001b[39misinstanceu001b[39m(converted, UnknownException):nu001b[0;32m 194u001b[0m u001b[39m# Hide where the exception came from that shows a non-Pythonicu001b[39;00mnu001b[0;32m 195u001b[0m u001b[39m# JVM exception message.u001b[39;00mnu001b[1;32m--> 196u001b[0m u001b[39mraiseu001b[39;00m converted u001b[39mfromu001b[39;00m u001b[39mNoneu001b[39mnu001b[0;32m 197u001b[0m u001b[39melseu001b[39;00m:nu001b[0;32m 198u001b[0m u001b[39mraiseu001b[39;00mnnu001b[1;31mParseExceptionu001b[0m: nSyntax error at or near '$'(line 1, pos 9)nn== SQL ==nfirstname$$ == 'James'n---------^^^n"
}
Here is the code I am trying.
df_sam1 = df_sam.where("firstname$$ == 'James'")
df_sam1.show()
I tried to rename the column to fitstname and still getting the same error.
Can anyone look at it and help me to resolve this issue.
Thanks,
Bab
Answers:
In Spark SQL $
character is not allowed in column names when utilizing SQL expressions, see this. To circumvent this problem, you can either employ DataFrame API functions for filtering the data:
from pyspark.sql.functions import col
df_sam1 = df_sam.filter(col("firstname$$") == 'James')
df_sam1.show()
or rename the column:
df_sam_renamed = df_sam.withColumnRenamed("firstname$$", "firstname").withColumnRenamed("lastname$$", "lastname")
df_sam1 = df_sam_renamed.where("firstname == 'James'")
df_sam1.show()
mask by the value of ‘firstname$$’s value like so:
print(df_sam1[df_sam1['firstname$$'] == 'James']])
I have dataframe with column names as "lastname$$" and "firstname$$"
+-----------+----------+----------+------------------+-----+------+
|firstname$$|middlename|lastname$$|languages |state|gender|
+-----------+----------+----------+------------------+-----+------+
|James | |Smith |[Java, Scala, C++]|OH |M |
|Anna |Rose | |[Spark, Java, C++]|NY |F |
|Julia | |Williams |[CSharp, VB] |OH |F |
|Maria |Anne |Jones |[CSharp, VB] |NY |M |
|Jen |Mary |Brown |[CSharp, VB] |NY |M |
|Mike |Mary |Williams |[Python, VB] |OH |M |
+-----------+----------+----------+------------------+-----+------+
I am trying to filter this dataframe for firstname$$ = "James" and getting following error.
{
"name": "ParseException",
"message": "nSyntax error at or near '$'(line 1, pos 9)nn== SQL ==nfirstname$$ == 'James'n---------^^^n",
"stack": "u001b[1;31m---------------------------------------------------------------------------u001b[0mnu001b[1;31mParseExceptionu001b[0m Traceback (most recent call last)nu001b[1;32md:\Users\sample.ipynb Cell 3u001b[0m in u001b[0;36m<cell line: 1>u001b[1;34m()u001b[0mnu001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a>u001b[0m df_sam1 u001b[39m=u001b[39m df_samu001b[39m.u001b[39;49mwhere(u001b[39m"u001b[39;49mu001b[39mfirstname$$ == u001b[39;49mu001b[39m'u001b[39;49mu001b[39mJamesu001b[39;49mu001b[39m'u001b[39;49mu001b[39m"u001b[39;49m)nu001b[0;32m <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a>u001b[0m df_sam1u001b[39m.u001b[39mshow()nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\pyspark\sql\dataframe.py:2077u001b[0m, in u001b[0;36mDataFrame.filteru001b[1;34m(self, condition)u001b[0mnu001b[0;32m 2052u001b[0m u001b[39m"""Filters rows using the given condition.u001b[39;00mnu001b[0;32m 2053u001b[0m nu001b[0;32m 2054u001b[0m u001b[39m:func:`where` is an alias for :func:`filter`.u001b[39;00mnu001b[1;32m (...)u001b[0mnu001b[0;32m 2074u001b[0m u001b[39m[Row(age=2, name='Alice')]u001b[39;00mnu001b[0;32m 2075u001b[0m u001b[39m"""u001b[39;00mnu001b[0;32m 2076u001b[0m u001b[39mifu001b[39;00m u001b[39misinstanceu001b[39m(condition, u001b[39mstru001b[39m):nu001b[1;32m-> 2077u001b[0m jdf u001b[39m=u001b[39m u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49m_jdfu001b[39m.u001b[39;49mfilter(condition)nu001b[0;32m 2078u001b[0m u001b[39melifu001b[39;00m u001b[39misinstanceu001b[39m(condition, Column):nu001b[0;32m 2079u001b[0m jdf u001b[39m=u001b[39m u001b[39mselfu001b[39mu001b[39m.u001b[39m_jdfu001b[39m.u001b[39mfilter(conditionu001b[39m.u001b[39m_jc)nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py:1321u001b[0m, in u001b[0;36mJavaMember.__call__u001b[1;34m(self, *args)u001b[0mnu001b[0;32m 1315u001b[0m command u001b[39m=u001b[39m protou001b[39m.u001b[39mCALL_COMMAND_NAME u001b[39m+u001b[39m\nu001b[0;32m 1316u001b[0m u001b[39mselfu001b[39mu001b[39m.u001b[39mcommand_header u001b[39m+u001b[39m\nu001b[0;32m 1317u001b[0m args_command u001b[39m+u001b[39m\nu001b[0;32m 1318u001b[0m protou001b[39m.u001b[39mEND_COMMAND_PARTnu001b[0;32m 1320u001b[0m answer u001b[39m=u001b[39m u001b[39mselfu001b[39mu001b[39m.u001b[39mgateway_clientu001b[39m.u001b[39msend_command(command)nu001b[1;32m-> 1321u001b[0m return_value u001b[39m=u001b[39m get_return_value(nu001b[0;32m 1322u001b[0m answer, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mgateway_client, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mtarget_id, u001b[39mselfu001b[39;49mu001b[39m.u001b[39;49mname)nu001b[0;32m 1324u001b[0m u001b[39mforu001b[39;00m temp_arg u001b[39minu001b[39;00m temp_args:nu001b[0;32m 1325u001b[0m temp_argu001b[39m.u001b[39m_detach()nnFile u001b[1;32mD:\SparkApp\spark-3.3.1-bin-hadoop3\python\pyspark\sql\utils.py:196u001b[0m, in u001b[0;36mcapture_sql_exception.<locals>.decou001b[1;34m(*a, **kw)u001b[0mnu001b[0;32m 192u001b[0m converted u001b[39m=u001b[39m convert_exception(eu001b[39m.u001b[39mjava_exception)nu001b[0;32m 193u001b[0m u001b[39mifu001b[39;00m u001b[39mnotu001b[39;00m u001b[39misinstanceu001b[39m(converted, UnknownException):nu001b[0;32m 194u001b[0m u001b[39m# Hide where the exception came from that shows a non-Pythonicu001b[39;00mnu001b[0;32m 195u001b[0m u001b[39m# JVM exception message.u001b[39;00mnu001b[1;32m--> 196u001b[0m u001b[39mraiseu001b[39;00m converted u001b[39mfromu001b[39;00m u001b[39mNoneu001b[39mnu001b[0;32m 197u001b[0m u001b[39melseu001b[39;00m:nu001b[0;32m 198u001b[0m u001b[39mraiseu001b[39;00mnnu001b[1;31mParseExceptionu001b[0m: nSyntax error at or near '$'(line 1, pos 9)nn== SQL ==nfirstname$$ == 'James'n---------^^^n"
}
Here is the code I am trying.
df_sam1 = df_sam.where("firstname$$ == 'James'")
df_sam1.show()
I tried to rename the column to fitstname and still getting the same error.
Can anyone look at it and help me to resolve this issue.
Thanks,
Bab
In Spark SQL $
character is not allowed in column names when utilizing SQL expressions, see this. To circumvent this problem, you can either employ DataFrame API functions for filtering the data:
from pyspark.sql.functions import col
df_sam1 = df_sam.filter(col("firstname$$") == 'James')
df_sam1.show()
or rename the column:
df_sam_renamed = df_sam.withColumnRenamed("firstname$$", "firstname").withColumnRenamed("lastname$$", "lastname")
df_sam1 = df_sam_renamed.where("firstname == 'James'")
df_sam1.show()
mask by the value of ‘firstname$$’s value like so:
print(df_sam1[df_sam1['firstname$$'] == 'James']])