Ratio after a groupby in pyspark

Question

I have a pyspark df like this

+------------+-------------+
|Gender      |     Language|
+------------+-------------+
|        Male|      Spanish|
|      Female|      English|
|      Female|       Indian|
|      Female|      Spanish|
|      Female|       Indian|
|        Male|      English|
|        Male|      English|
|      Female|Latin Spanish|
|        Male|      Spanish|
|      Female|      English|
|        Male|       Indian|
|        Male|      Catalan|
|        Male|      Spanish|
|        Male|      Russian|
|        Male|      Spanish|
|        Male|      Spanish|
|      Female|      Russian|
|      Female|      Spanish|
|        Male|      English|
|        Male|      Spanish|
+------------+-------------+

I want to know the male:female ratio (counts) for each Language. How to do this in pyspark?

I thought of taking the counts and then looping across each language

counts = df.groupby('Language', 'Gender').count()
# loop across all languages and both genders
counts.filter((counts.Language == 'Italian') & (counts.Gender == 'Male')).show()

But is there a more elegant way to do this?

Asked By: theodre7

||

Source

Answer 1

You can count with condition.

df.groupBy('Language') 
  .agg(
    f.count(f.when(f.col('Gender') == 'Male', True)).alias('Male'),
    f.count(f.when(f.col('Gender') == 'Female', True)).alias('Female')
  ) 
  .show(truncate=False)

+-------------+----+------+
|Language     |Male|Female|
+-------------+----+------+
|Indian       |1   |2     |
|English      |3   |2     |
|Spanish      |6   |2     |
|Latin Spanish|0   |1     |
|Catalan      |1   |0     |
|Russian      |1   |1     |
+-------------+----+------+

Or use pivot.

df.groupBy('Language') 
  .pivot('Gender') 
  .count() 
  .show(truncate=False)

+-------------+------+----+
|Language     |Female|Male|
+-------------+------+----+
|Indian       |2     |1   |
|English      |2     |3   |
|Spanish      |2     |6   |
|Catalan      |null  |1   |
|Russian      |1     |1   |
|Latin Spanish|1     |null|
+-------------+------+----+

Answered By: Lamanus

Ratio after a groupby in pyspark

Question:

Answers: