how I can use count and groupby at the same time?

Question:

I want to count and groupby based on count, how I can do that?

For example I want to groupby col1 on rows whose count is less than 3.

Col1          col2
"sal"         "salis"
"sal"         "salmon"
"bar"         "baran"
"dar"         "darya"
"sal"         "sald"
"bar"         "bard"

result

groupby   collect
"bar"      ["baran","bard"]
"dar"      ["darya"]

I know the following code gives me count of each element:

df1_us2.groupBy('col1').count().orderBy('count',ascending=False)

and use the following for groupby. but how I can combine these two?

df1_us2.groupBy('col1').agg(collect_list('col2').alias('collect'))
Asked By: user15649753

||

Answers:

First groupby and aggregate col2 using collect_list then filter the rows where size of list is less than 3

(
    df1_us2
    .groupby('col1')
    .agg(F.collect_list('col2').alias('col2'))
    .where(F.size('col2') < 3)
)

+-----+-----------------+
| col1|             col2|
+-----+-----------------+
|"bar"|["baran", "bard"]|
|"dar"|        ["darya"]|
+-----+-----------------+
Answered By: Shubham Sharma
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.