Ratio after a groupby in pyspark
Question:
I have a pyspark df like this
+------------+-------------+
|Gender | Language|
+------------+-------------+
| Male| Spanish|
| Female| English|
| Female| Indian|
| Female| Spanish|
| Female| Indian|
| Male| English|
| Male| English|
| Female|Latin Spanish|
| Male| Spanish|
| Female| English|
| Male| Indian|
| Male| Catalan|
| Male| Spanish|
| Male| Russian|
| Male| Spanish|
| Male| Spanish|
| Female| Russian|
| Female| Spanish|
| Male| English|
| Male| Spanish|
+------------+-------------+
I want to know the male:female ratio (counts) for each Language
. How to do this in pyspark?
I thought of taking the counts and then looping across each language
counts = df.groupby('Language', 'Gender').count()
# loop across all languages and both genders
counts.filter((counts.Language == 'Italian') & (counts.Gender == 'Male')).show()
But is there a more elegant way to do this?
Answers:
You can count with condition.
df.groupBy('Language')
.agg(
f.count(f.when(f.col('Gender') == 'Male', True)).alias('Male'),
f.count(f.when(f.col('Gender') == 'Female', True)).alias('Female')
)
.show(truncate=False)
+-------------+----+------+
|Language |Male|Female|
+-------------+----+------+
|Indian |1 |2 |
|English |3 |2 |
|Spanish |6 |2 |
|Latin Spanish|0 |1 |
|Catalan |1 |0 |
|Russian |1 |1 |
+-------------+----+------+
Or use pivot.
df.groupBy('Language')
.pivot('Gender')
.count()
.show(truncate=False)
+-------------+------+----+
|Language |Female|Male|
+-------------+------+----+
|Indian |2 |1 |
|English |2 |3 |
|Spanish |2 |6 |
|Catalan |null |1 |
|Russian |1 |1 |
|Latin Spanish|1 |null|
+-------------+------+----+
I have a pyspark df like this
+------------+-------------+
|Gender | Language|
+------------+-------------+
| Male| Spanish|
| Female| English|
| Female| Indian|
| Female| Spanish|
| Female| Indian|
| Male| English|
| Male| English|
| Female|Latin Spanish|
| Male| Spanish|
| Female| English|
| Male| Indian|
| Male| Catalan|
| Male| Spanish|
| Male| Russian|
| Male| Spanish|
| Male| Spanish|
| Female| Russian|
| Female| Spanish|
| Male| English|
| Male| Spanish|
+------------+-------------+
I want to know the male:female ratio (counts) for each Language
. How to do this in pyspark?
I thought of taking the counts and then looping across each language
counts = df.groupby('Language', 'Gender').count()
# loop across all languages and both genders
counts.filter((counts.Language == 'Italian') & (counts.Gender == 'Male')).show()
But is there a more elegant way to do this?
You can count with condition.
df.groupBy('Language')
.agg(
f.count(f.when(f.col('Gender') == 'Male', True)).alias('Male'),
f.count(f.when(f.col('Gender') == 'Female', True)).alias('Female')
)
.show(truncate=False)
+-------------+----+------+
|Language |Male|Female|
+-------------+----+------+
|Indian |1 |2 |
|English |3 |2 |
|Spanish |6 |2 |
|Latin Spanish|0 |1 |
|Catalan |1 |0 |
|Russian |1 |1 |
+-------------+----+------+
Or use pivot.
df.groupBy('Language')
.pivot('Gender')
.count()
.show(truncate=False)
+-------------+------+----+
|Language |Female|Male|
+-------------+------+----+
|Indian |2 |1 |
|English |2 |3 |
|Spanish |2 |6 |
|Catalan |null |1 |
|Russian |1 |1 |
|Latin Spanish|1 |null|
+-------------+------+----+