How can I select rows randomly in proportion to the number of unique values for each group in Python?

Question:

I would like to random select rows proportionate to the number of unique values in column "ID" grouping by column "Team". Further, I would like to only retrieve 8 total rows. I have:

|  ID   |  Team |  Color       |
| ----- | ----- | ------------ |
|  1    |  A    |  Blue        |
|  2    |  B    |  Red         |
|  2    |  B    |  Green       |
|  3    |  A    |  Blue        |
|  6    |  C    |  Red         |
|  1    |  B    |  Yellow      |
|  2    |  B    |  Green       |
|  9    |  A    |  Blue        |
|  6    |  C    |  Red         |
|  1    |  B    |  Yellow      |
|  9    |  A    |  Blue        |
|  1    |  A    |  Purple      |

Only the proportions are looking at unique values. The rows pulled do not necessarily need to be unique in anyway. Using the above table the proportions would be:

|  Team  | Unique IDs |  Proportion |  Number selected |
| ------ | ---------- | ----------- | ---------------- |
|  A     |    3       |  0.500      |       4          |
|  B     |    2       |  0.333      |       3          |
|  C     |    1       |  0.167      |       1          |

So since I want 8 total rows selected proportionately, I should end up with something like the following:

|  ID   |  Team |  Color       |
| ----- | ----- | ------------ |
|  1    |  A    |  Blue        |
|  3    |  A    |  Blue        |
|  9    |  A    |  Blue        |
|  1    |  A    |  Purple      |
|  2    |  B    |  Green       |
|  2    |  B    |  Red         |
|  1    |  B    |  Yellow      |
|  6    |  C    |  Red         |
Asked By: brandooo23

||

Answers:

  • Calculate unique_counts – the number of unique 'ID's in each 'Team' group,
  • convert unique_counts into nums_selected – the number of elements for each group that have to be selected
  • use nums_selected to .sample this many elements from each 'Team' group:
n_total = 8
unique_counts = df.groupby('Team')['ID'].agg('nunique')
nums_selected= np.floor(unique_counts / unique_counts.sum() * n_total).astype(int)  # rounded down

df.groupby('Team', group_keys=False).apply(      # for each 'Team' group:
    lambda x: x.sample(n=nums_selected[x.name],  # sample this many rows
                       replace=True)             # (with replacement)
    )

Note:

The result can contain less elements than n_total because nums_selected are rounded down when converting from float to int. However, you may use any method to do this conversion: np.ceil, pd.Series.round, or any other method you choose to your liking.

Answered By: Vladimir Fokow
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.