Deleting immediately subsequent rows which are the exact same as the previous for specific columns

Question:

I have a dataframe similar to the following.

import pandas as pd

data = pd.DataFrame({'ind': [111,222,333,444,555,666,777,888,999,000], 
                     'col1': [1,2,2,2,3,4,5,5,6,7], 
                     'col2': [9,2,2,2,9,9,5,5,9,9], 
                     'col3': [11,2,2,2,11,11,5,5,11,11], 
                     'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})

There is an index ind, a number of columns col 1, 2 and 3, and some other column with a value val. Within the three columns 1, 2 and 3 there are a number of rows which are the exact same as the previous row, for instance row with index 333 and 444 are the same as 222. My actual data set is larger but what I need to do is delete all rows which have the exact same value as the immediate previous row for a number of columns (col1, col2, col3 here).

This would give me a dataframe like this with indeces 333/444 and 888 removed:

data_clean = pd.DataFrame({'ind': [111,222,555,666,777,999,000], 
                     'col1': [1,2,3,4,5,6,7], 
                     'col2': [9,2,9,9,5,9,9], 
                     'col3': [11,2,11,11,5,11,11], 
                     'val': ['a', 'b', 'e', 'f', 'g', 'i', 'j']})

What is the best way to go about this for a larger dataframe?

Asked By: Bas R

||

Answers:

You can use shift and any for boolean indexing:

cols = ['col1', 'col2', 'col3']
out = data[data[cols].ne(data[cols].shift()).any(axis=1)]
# DeMorgan's equivalent:
# out = data[~data[cols].eq(data[cols].shift()).all(axis=1)]

Output:

   ind  col1  col2  col3 val
0  111     1     9    11   a
1  222     2     2     2   b
4  555     3     9    11   e
5  666     4     9    11   f
6  777     5     5     5   g
8  999     6     9    11   i
9    0     7     9    11   j

Intermediates

# shifted dataset
data[cols].shift()

   col1  col2  col3
0   NaN   NaN   NaN
1   1.0   9.0  11.0
2   2.0   2.0   2.0
3   2.0   2.0   2.0
4   2.0   2.0   2.0
5   3.0   9.0  11.0
6   4.0   9.0  11.0
7   5.0   5.0   5.0
8   5.0   5.0   5.0
9   6.0   9.0  11.0

# comparison
data[cols].ne(data[cols].shift())

    col1   col2   col3
0   True   True   True
1   True   True   True
2  False  False  False
3  False  False  False
4   True   True   True
5   True  False  False
6   True   True   True
7  False  False  False
8   True   True   True
9   True  False  False

# aggregation
data[cols].ne(data[cols].shift()).any(axis=1)

0     True
1     True
2    False
3    False
4     True
5     True
6     True
7    False
8     True
9     True
dtype: bool
Answered By: mozway

One way to do this would be to use the shift() method to compare the values in each row with the previous row for the columns you’re interested in, and then use the boolean indexing to select only the rows that have different values from the previous row. Here’s an example:

# First, let's create a DataFrame with the columns we want to compare
compare_df = data[['col1', 'col2', 'col3']]

# Next, use the shift() method to compare each row with the previous row
comparison_result = compare_df.eq(compare_df.shift())

# Now we can use boolean indexing to select only the rows that have different values
# from the previous row
data_clean = data[~comparison_result.all(axis=1)]

This should give you a DataFrame that only contains rows that are different from the previous row for the specified columns.

Alternatively, you could use the drop_duplicates() method to remove duplicate rows based on the values in the specified columns. This method will remove all rows that have the same values for the specified columns, regardless of whether they are consecutive or not. Here’s an example:

# Use the drop_duplicates() method to remove duplicate rows
data_clean = data.drop_duplicates(subset=['col1', 'col2', 'col3'])

This should give you the same result as the previous approach, but in a more concise and straightforward way.

Answered By: skalthoff