Dealing with NaN while searching for incomplete duplicate rows in the DataFrame

Question:

It’s a bit hard to explain, but bear with me. Suppose we have the following dataset:

df = pd.DataFrame({'foo': [1, 1, 1, 8, 1, 5, 5, 5],
                   'bar': [2, float('nan'), 2, 5, 2, 3, float('nan'), 6],
                   'abc': [3, 3, 3, 7, float('nan'), 9, 9, 7],
                   'def': [4, 4, 4, 2, 4, 8, 8, 8]})
print(df)
>>>
   foo  bar  abc  def
0    1  2.0  3.0    4
1    1  NaN  3.0    4
2    1  2.0  3.0    4
3    8  5.0  7.0    2
4    1  2.0  NaN    4
5    5  3.0  9.0    8
6    5  NaN  9.0    8
7    5  6.0  7.0    8

Our goal is to find all duplicate rows. However, some of these duplicates are incomplete, because they have NaN values. Nevertheless, we want to find these duplicates too. So the expected result is:

   foo  bar  abc  def
0    1  2.0  3.0    4
1    1  NaN  3.0    4
2    1  2.0  3.0    4
4    1  2.0  NaN    4
5    5  3.0  9.0    8
6    5  NaN  9.0    8

If we try to do this in a straightforward way, that only gives us complete rows:

print(df[df.duplicated(keep=False)])
>>>
   foo  bar  abc  def
0    1  2.0  3.0    4
2    1  2.0  3.0    4

We can try to circumvent it by using only columns that don’t have any missing values:

print(df[df.duplicated(['foo', 'def'], keep=False)])
>>>
   foo  bar  abc  def
0    1  2.0  3.0    4
1    1  NaN  3.0    4
2    1  2.0  3.0    4
4    1  2.0  NaN    4
5    5  3.0  9.0    8
6    5  NaN  9.0    8
7    5  6.0  7.0    8

Very close, but not quite. Turns out we’re missing out on a crucial piece of information in the ‘abc’ column that lets us determine that row 7 is not a duplicate. So we’d want to include it:

print(df[df.duplicated(['foo', 'def', 'abc'], keep=False)])
>>>
   foo  bar  abc  def
0    1  2.0  3.0    4
1    1  NaN  3.0    4
2    1  2.0  3.0    4
5    5  3.0  9.0    8
6    5  NaN  9.0    8

And it succeeds in removing row 7. However, it also removes row 4. NaN is considered its own separate value, rather than something that could be equal to anything, so its presence in row 4 prevents us from detecting this duplicate.

Now, I’m aware that we don’t know for sure if row 4 really is [1, 2, 3, 4]. For all we know, it can be something else entirely, like [1, 2, 9, 4]. But let’s say that values 1 and 4 are actually some other values that are oddly specific. For example, 34900 and 23893. And let’s say that there are many more columns that are also exactly the same. Moreover, the complete duplicate rows are not just 0 and 2, there are over two hundred of them, and then another 40 rows that have these same values in all columns except for ‘abc’, where they have NaN. So for this particular group of duplicates such coincidences are extremely improbable, and that’s how we know for certain that the record [1, 2, 3, 4] is problematic, and that row 4 is almost certainly a duplicate.

However, if [1, 2, 3, 4] is not the only group of duplicates, then it’s possible that some other groups have very un-specific values in the ‘foo’ and ‘def’ columns, like 1 and 500. And it so happens that including the column ‘abc’ in the subset would be extremely helpful in resolving this issue, because the values in ‘abc’ column are nearly always very specific, and allow to determine all duplicates with a near-certainty. But there’s a drawback – ‘abc’ column has missing values, so by using it we’re sacrificing detection of some duplicates with NaNs. Some of them we know for a fact to be duplicates (like the aforementioned 40), so it’s a hard dilemma.

What would be the best way to deal with this situation? It would be nice if we could somehow make NaNs equal to everything, rather than nothing, for the duration of duplicate detection, that would resolve this issue. But I doubt this is possible. Am I supposed to just go group by group and check this manually?

Asked By: UchuuStranger

||

Answers:

Thanks to @cs95 for help in figuring this out. When we sort values, NaNs are put at the end of sorting group by default, and if the incomplete record has a duplicate with an existing value instead of this NaN, it will end up right on top of NaN. That means we can fill this NaN with that value by using ffill() method. So we’re forward filling missing data with data from the rows that are closest to them, so we can then make a more accurate determination of whether that row is a duplicate.

The code I ended up using (adjusted to this reproducible example) looks like this:

#printing all duplicates
col_list = ['foo', 'def', 'abc', 'bar']
show_mask = df.sort_values(col_list).ffill().duplicated(col_list, keep=False).sort_index()
df[show_mask].sort_values(col_list)

#deleting duplicates, but keeping one record per duplicate group
delete_mask = df.sort_values(col_list).ffill().duplicated(col_list).sort_index()
df = df[~delete_mask].reset_index(drop=True)

It’s possible to use bfill() instead of ffill(), since it’s the same principle applied upside down. But it requires changing some default parameters of methods used to opposite ones, namely na_position='first' and keep='last'. sort_index() is used just to silence the reindexing warning.

Note that the order in which you list the columns is very important, as it is used for sorting priorities. To make sure that the record above the missing value is the correct value to be copied, you have to enumerate all the columns that don’t have any missing values first, and only then the ones that do. For the former columns the order doesn’t really matter. For the latter ones it is crucial to start from the column that has the most diverse/specific values and end with the least diverse/specific one (float -> int -> string -> bool is a good rule of thumb, but it largely depends on what exact kind of variables the columns represent in your dataset). In this example they’re all the same, but even here you won’t get the right answer if you put ‘bar’ before ‘abc’.

And even then it’s not a perfect solution. It does a pretty good job of putting the most complete version of the record at the top, and transfer the information in it to less complete versions below whenever needed. But there’s a possibility that the fully complete version of the record simply doesn’t exist. For example, let’s say there are records [5 3 Nan 8] and [5 NaN 9 8] (and there’s no [5 3 9 8] record). This solution is not capable of letting them swap the missing pieces with each other. It will put 9 in the former, but NaN in the latter will remain empty, and will cause these duplicates to go unnoticed.

This is not an issue if you’re dealing with just a single incomplete column, but each added incomplete column will make such cases more and more frequent. However, it it still preferable to add all the columns, because failing to detect some duplicates is better than end up with some false duplicates in your list, which is a distinct possibility unless you’re using all the columns.

Answered By: UchuuStranger

Sorry to bother, but I’m afraid your code doesn’t always work as expected.

An example follows:

column_list = ['c1','c2','c3']
data = [
    [1,2,3],
    [np.nan,2,3],
    [1,np.nan,3],
    [2,3,4],
    [1,1,1],
    [1,2,3],
]

df = pd.DataFrame(
    columns=column_list,
    data=data)
df
+----+------+------+------+
|    |   c1 |   c2 |   c3 |
|----+------+------+------|
|  0 |    1 |    2 |    3 |
|  1 |  nan |    2 |    3 |
|  2 |    1 |  nan |    3 |
|  3 |    2 |    3 |    4 |
|  4 |    1 |    1 |    1 |
|  5 |    1 |    2 |    3 |
+----+------+------+------+
sorted_df = df.sort_values(column_list)
mask = sorted_df.ffill().duplicated(column_list).sort_index()
df[np.logical_not(mask)]

result:

+----+------+------+------+
|    |   c1 |   c2 |   c3 |
|----+------+------+------|
|  0 |    1 |    2 |    3 |
|  1 |  nan |    2 |    3 |
|  3 |    2 |    3 |    4 |
|  4 |    1 |    1 |    1 |
+----+------+------+------+

Answered By: Emanuele Pepe
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.