Python function such as max() doesn't work in pyspark application

Question:

Python function max(3,6) works under pyspark shell. But if it is put in an application and submit, it will throw an error:
TypeError: _() takes exactly 1 argument (2 given)

Asked By: user3610141

||

Answers:

It looks like you have an import conflict in your application most likely due to wildcard import from pyspark.sql.functions:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Oct 19 2015 18:04:42)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]: max(1, 2)
Out[1]: 2

In [2]: from pyspark.sql.functions import max

In [3]: max(1, 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-bb133f5d83e9> in <module>()
----> 1 max(1, 2)

TypeError: _() takes exactly 1 argument (2 given)

Unless you work in a relatively limited it is best to either perfix:

from pyspark.sql import functions as sqlf

max(1, 2)
## 2

sqlf.max("foo")
## Column<max(foo)>

or alias:

from pyspark.sql.functions import max as max_

max(1, 2)
## 2

max_("foo")
## Column<max(foo)>
Answered By: zero323

If you get this error even after verifying that you have NOT used from pyspark.sql.functions import *, then try the following:

Use import builtins as py_builtin
And then correspondingly call it with the same prefix.
Eg: py_builtin.max()

*Adding David Arenburg’s and user3610141’s comments as an answer, as that is what help me fix my problem in databricks where there was a name collision with min() and max() of pyspark with python built-ins.

Answered By: DeadLock
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.