Pandas keep rows where column values change at least twice

Question:

Good day,

I have a large dataset with columns that keep track of the scores each person obtains. Here is a sample of the dataset:

In:

data = [[7, 10, 10, 10, 10], [17, 10, 10, 10, 10], [18, 8, 10, 10, 10], [20, 10, 10, 9, 9], [25, 9, 8, 8, 7]]
df = pd.DataFrame(data, columns = ['person_id', 'score_1', 'score_2', 'score_3', 'score_4'])
df

Out:

    person_id   score_1 score_2 score_3 score_4
0   7           10      10      10      10
1   17          10      10      10      10
2   18          8       10      10      10
3   20          10      10      9       9
4   25          9       8       8       7

I need to find the rows where the column values change at least twice. For example:

  • Row 0: All values are the same, no change
  • Row 1: All values are the same, no change
  • Row 2: 1st and 2nd values differ, 1 change
  • Row 3: 2nd and 3rd values differ, 1 change
  • Row 4: 1st, 2nd and 4th values differ, 2 changes

That means that only row 4 meets my requirements.

Thus, the desired output would be:

    person_id   score_1 score_2 score_3 score_4
4   25          9       8       8       7

All help greatly appreciated!

Asked By: CarlaVdB

||

Answers:

IIUC, you want to count the number of unique values, per rows, limited to the "score*" columns.

You can use nunique on the rows after getting the correct columns with filter. Then slice:

df[df.filter(like='score').nunique(axis=1).gt(2)]

If you really want the changes from left to right so that A->B->A->B counts for 3 changes:

df[df.filter(like='score').diff(axis=1).ne(0).sum(axis=1).gt(2)]

output:

   person_id  score_1  score_2  score_3  score_4
4         25        9        8        8        7
Answered By: mozway

Mozway’s answer for the case A->B->A->B counts is brillant:

df[df.filter(like='score').diff(axis=1).ne(0).sum(axis=1).gt(2)]

A more plain answer, based on Mozway’s one could be :

df[ # following lines build a boolean mask
    df.filter(like='score')     # keep only scores columns
      .diff(axis=1).iloc[:, 1:] # diff the columns and keep all except the first one whose values are all `NaN`
      .ne(0).sum(axis=1)        # count changes: values that differ from 0 
      .ge(2)                    # finish the boolean mask: changes >= 2
  ]
Answered By: hpchavaz
df[((df.set_index('person_id').diff(axis=1)
     .fillna(0)!=0).sum(1)>1).reset_index(drop=True)
]

  person_id  score_1  score_2  score_3  score_4
4         25        9        8        8        7
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.