Pandas – select rows whose column value appears `n` times

Question:

I am trying to retrieve a random df entry from all the entries of a df that appear n times in the df itself, but am facing some problems.
This is the code I’m using, with n = 2.


d = {
    "letters": ["a", "b", "c", "a", "b", "a", "d", "d"], 
    "one": [1, 1, 1, 1, 1, 1, 1, 1],
    "two": [2, 2, 2, 2, 2, 2, 2, 2],
}
df = pd.DataFrame(d)
s = df["letters"].value_counts()
df2 = df.loc[np.where(s.to_numpy() == 2)]
rand = df2.sample(n=1, random_state = 2)

This would look ok to a first look, but inspecting df2 returns that df2["letters"] has two entries: "b" and "c’, and clearly "c" does not appear twice in the original df.

I guess that the error needs to be in the way I define the concept of "look only to the entries that appear n times, but I can’t wrap my mind around this.

What is going on here, and how can I fix the problem?

Asked By: Antonio Carnevali

||

Answers:

Use Series.map by original column letters for filtering:

s = df["letters"].value_counts()
df2 = df[df["letters"].map(s) == 2]

print (df2)
  letters  one  two
1       b    1    2
4       b    1    2
6       d    1    2
7       d    1    2

Then if need random row per letters use:

rand = df2.groupby('letters').sample(n=1, random_state = 2)
print (rand)
  letters  one  two
4       b    1    2
6       d    1    2
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.