How to print only a certain column of DataFrame in PySpark?

Question:

Can one use the actions collect or take to print only a given column of DataFrame?

This

df.col.collect()

gives error

TypeError: ‘Column’ object is not callable

and this:

df[df.col].take(2)

gives

pyspark.sql.utils.AnalysisException: u”filter expression ‘col’ of type string is not a boolean.;”

Asked By: mar tin

||

Answers:

select and show:

df.select("col").show()

or select, flatMap, collect:

df.select("col").rdd.flatMap(list).collect()

Bracket notation (df[df.col]) is used only for logical slicing and columns by itself (df.col) are not distributed data structures but SQL expressions and cannot be collected.

Answered By: zero323