Back-ticks in DataFrame.colRegex?
Question:
For PySpark, I find back-ticks enclosing regular expressions for
DataFrame.colRegex()
here,
here,
and in this SO
question. Here is the
example from the DataFrame.colRegex
doc string:
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
+----+
|Col2|
+----+
| 1|
| 2|
| 3|
+----+
The answer to the SO question
doesn’t show back-ticks for Scala. It refers to the Java
documentation for the Pattern
class,
but that doesn’t explain back-ticks.
This page
indicates the use of back-ticks in Python to represent the string
representation of the adorned variable, but that doesn’t apply
to a regular expression.
What is the explanation for the back-ticks?
Answers:
The back-ticks are used to delimit the column name in case it includes special characters. For example, if you had a column called column-1
and you try
SELECT column-1 FROM mytable
You will probably get a
non-existent column ‘column’
error as the interpreter will treat that as SELECT (column) - 1 FROM mytable
. Instead, you can delimit the column name with back-ticks to get around that issue:
SELECT `column-1` FROM mytable
For PySpark, I find back-ticks enclosing regular expressions for
DataFrame.colRegex()
here,
here,
and in this SO
question. Here is the
example from the DataFrame.colRegex
doc string:
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
+----+
|Col2|
+----+
| 1|
| 2|
| 3|
+----+
The answer to the SO question
doesn’t show back-ticks for Scala. It refers to the Java
documentation for the Pattern
class,
but that doesn’t explain back-ticks.
This page
indicates the use of back-ticks in Python to represent the string
representation of the adorned variable, but that doesn’t apply
to a regular expression.
What is the explanation for the back-ticks?
The back-ticks are used to delimit the column name in case it includes special characters. For example, if you had a column called column-1
and you try
SELECT column-1 FROM mytable
You will probably get a
non-existent column ‘column’
error as the interpreter will treat that as SELECT (column) - 1 FROM mytable
. Instead, you can delimit the column name with back-ticks to get around that issue:
SELECT `column-1` FROM mytable