Spark SQL pass variable to query
Question:
I have looked all over for an answer to this and tried everything. Nothing seems to work. I’m trying to reference a variable assignment within a spark.sql query in python. Running python 3 and spark version 2.3.1.
bkt = 1
prime = spark.sql(s"SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ=$bkt
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)")
prime.show(50)
The error:
prime = spark.sql(s"SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts FROM pwrcrv_tmp where EXTR_CURR_NUM_CYC_DLQ=$bkt and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA') group by ((year(fdr_date))*100)+month(fdr_date) order by ((year(fdr_date))*100)+month(fdr_date)")
^
SyntaxError: invalid syntax
Answers:
I found the correct syntax buried in this databricks post.
https://forums.databricks.com/questions/115/how-do-i-pass-parameters-to-my-sql-statements.html
You add a lower case f in front of the query and wrap braces around the name of the variable in the query.
bkt = 1
prime = spark.sql(f"SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ={bkt}
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)")
prime.show(50)
This should work
p_filename ='some value'
z='some value'
query = "INSERT into default.loginfordetails (filename,logdesc) values ('{}','{}') ".format(p_filename,z)
spark.sql(query)
Since your query do use multi-line query is it advised as good coding style, to use """ """
to be easier and avoid slash all together as a good coding style""
that does have a lot of issues or conversion to be parsed in python and pyspark in general or when using nbconvert for notebook to scrip and variables
in parentheses {}
inside of query or use .format(bkt)
:
option{}
:
bkt=1;
prime = spark.sql(f"""SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ={bkt}
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)""")
prime.show(50);
option .format()
:
bkt=1;
prime = spark.sql(f"""SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ={}
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)""".format(bkt)
prime.show(50);
I have looked all over for an answer to this and tried everything. Nothing seems to work. I’m trying to reference a variable assignment within a spark.sql query in python. Running python 3 and spark version 2.3.1.
bkt = 1
prime = spark.sql(s"SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ=$bkt
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)")
prime.show(50)
The error:
prime = spark.sql(s"SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts FROM pwrcrv_tmp where EXTR_CURR_NUM_CYC_DLQ=$bkt and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA') group by ((year(fdr_date))*100)+month(fdr_date) order by ((year(fdr_date))*100)+month(fdr_date)")
^
SyntaxError: invalid syntax
I found the correct syntax buried in this databricks post.
https://forums.databricks.com/questions/115/how-do-i-pass-parameters-to-my-sql-statements.html
You add a lower case f in front of the query and wrap braces around the name of the variable in the query.
bkt = 1
prime = spark.sql(f"SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ={bkt}
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)")
prime.show(50)
This should work
p_filename ='some value'
z='some value'
query = "INSERT into default.loginfordetails (filename,logdesc) values ('{}','{}') ".format(p_filename,z)
spark.sql(query)
Since your query do use multi-line query is it advised as good coding style, to use """ """
to be easier and avoid slash all together as a good coding style""
that does have a lot of issues or conversion to be parsed in python and pyspark in general or when using nbconvert for notebook to scrip and variables
in parentheses {}
inside of query or use .format(bkt)
:
option
{}
:
bkt=1;
prime = spark.sql(f"""SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ={bkt}
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)""")
prime.show(50);
option
.format()
:
bkt=1;
prime = spark.sql(f"""SELECT ((year(fdr_date))*100)+month(fdr_date) as fdr_year, count(*) as counts
FROM pwrcrv_tmp
where EXTR_CURR_NUM_CYC_DLQ={}
and EXTR_ACCOUNT_TYPE in('PS','PT','PD','PC','HV','PA')
group by ((year(fdr_date))*100)+month(fdr_date)
order by ((year(fdr_date))*100)+month(fdr_date)""".format(bkt)
prime.show(50);