Pyspark groupBy DataFrame without aggregation or count

Question:

Can it iterate through the Pyspark groupBy dataframe without aggregation or count?

For example code in Pandas:

for i, d in df2:
      mycode ....

^^ if using pandas ^^
Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?
Asked By: Zhafari Irsyad

||

Answers:

When we do a GroupBy we end up with a RelationalGroupedDataset, which is a fancy name for a DataFrame that has a grouping specified but needs the user to specify an aggregation before it can be queried further.

When you try to do any functions on the Grouped dataframe it throws an error

AttributeError: 'GroupedData' object has no attribute 'show'
Answered By: DataWrangler

At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.

ex:

from pyspark.sql import functions as f
df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()

Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.

Their are a few work arounds to get what you want like:

for diamonds DataFrame:

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
|  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
|  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
|  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+

You can use:

l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()]
def groups(df,i):
  import pyspark.sql.functions as f
  return df.filter(f.col("cut")==i)

#for multi grouping
def groups_multi(df,i):
  import pyspark.sql.functions as f
  return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or.

for i in l:
  groups(diamonds,i).show(2)

output :

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  2| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  4| 0.29|Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 2 rows

+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|  cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 12| 0.23|Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+

...

In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.

Answered By: Andy_101

Yes, don’t use group by, use distinct with select instead.

df.select("col1", "col2", ...).distinct()

Then you could do any number of things for iterating through your DataFrame.

i.e.
1- Convert PySpark DF to Pandas.

DataFrame.toPandas()

2- If your DF is small, you could convert it to list.

DataFrame.collect()

3- Apply a method with foreach(your_method).

Dataframe.foreach(your_method)

4- Convert to RDD and use map with a lambda.

DataFrame.rdd.map(lambda x: your_method(x))
Answered By: Raza
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.