How to print only a certain column of DataFrame in PySpark?
Question:
Can one use the actions collect
or take
to print only a given column of DataFrame?
This
df.col.collect()
gives error
TypeError: ‘Column’ object is not callable
and this:
df[df.col].take(2)
gives
pyspark.sql.utils.AnalysisException: u”filter expression ‘col’ of type string is not a boolean.;”
Answers:
select
and show
:
df.select("col").show()
or select
, flatMap
, collect
:
df.select("col").rdd.flatMap(list).collect()
Bracket notation (df[df.col]
) is used only for logical slicing and columns by itself (df.col
) are not distributed data structures but SQL expressions and cannot be collected.
Can one use the actions collect
or take
to print only a given column of DataFrame?
This
df.col.collect()
gives error
TypeError: ‘Column’ object is not callable
and this:
df[df.col].take(2)
gives
pyspark.sql.utils.AnalysisException: u”filter expression ‘col’ of type string is not a boolean.;”
select
and show
:
df.select("col").show()
or select
, flatMap
, collect
:
df.select("col").rdd.flatMap(list).collect()
Bracket notation (df[df.col]
) is used only for logical slicing and columns by itself (df.col
) are not distributed data structures but SQL expressions and cannot be collected.