Splitting dataframe of duplicates based on criteria
Question:
I have a dataframe of duplicated Email Addresses.
ID
EmailAddress
Name
Country
Distance
IDLen
NonNAN
39203920
[email protected]
John
UK
12
8
6
32323
[email protected]
NaN
UK
12
5
5
I have created two additional columns that calculate the length of the ID and how many non NaN fields are for each row. I would like to create 2 new dataframes:
-
df1. Where the duplicate row has either the higher NonNAN
value; or if they’re the same, pick the lowest IDLen
.
-
df2. The remaining rows
I was thinking of using the df.duplicated()
function but it only looks at first or last and I need something more sophisticated.
Thanks in advance.
Answers:
You can create a boolean mask of rows you want to select for df1
– then automatically the inverse of this mask will select the rows for df2
.
For each group where EmailAddress
is the same:
-
mask
selects all the rows where NonNAN
has a maximum value.
-
if mask
selects multiple values –
it must also select the rows where IDLen
is the minimum.
-
if mask
still selects multiple values –
just take the first one.
def f(df):
mask = df['NonNAN'] == df['NonNAN'].max()
if mask.sum() > 1:
mask = mask & (df['IDLen'] == df.loc[mask, 'IDLen'].min())
if mask.sum() > 1:
mask.iloc[mask.argmax()+1 : ] = False
return mask
mask = df.groupby('EmailAddress', group_keys=False).apply(f)
mask = mask.reindex(df.index)
df1 = df[mask]
df2 = df[~mask]
Will this work for you? The key is to sort the data, then apply df.duplicated()
, which has very high efficiency rather than looping through each record like .apply(lambda)
functions
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': [39203920, 32323, 22222, 392999],
'EmailAddress': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Name': ['John', np.nan, 'Jane', 'John'],
'Country': ['UK', 'UK', 'UK', 'UK'],
'Distance': [12, 12, 12, 12],
'IDLen': [8, 5, 5, 6],
'NonNAN': [6, 5, 6, 6] })
df = df.sort_values(['EmailAddress', 'NonNAN', 'IDLen'], ascending=[True, False, True])
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
1 32323 [email protected] NaN UK 12 5 5
3 392999 [email protected] John UK 12 6 6
0 39203920 [email protected] John UK 12 8 6
Based on your rules, I have sorted the data so that the desired record is located first. When df.duplicated()
is applied on EmailAddress
, the first record will be kept
df1 = df[~df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
3 392999 [email protected] John UK 12 6 6
df2 = df[df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
0 39203920 [email protected] John UK 12 8 6
If your ID
column is numerical (ie, not alphanumeric), you can sort based on ascending ID
, and there is no need for the column IDLen
(because you would like the shortest one if ‘NonNAN’ is the same)
You can absolutely use duplicated
, you just have to sort your data according to your conditions.
Given df
:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
3 24233 [email protected] Robert AUS 32 2 5
Doing:
df = df.sort_values(['NonNAN', 'ID'], ascending=[False, True])
mask = df.duplicated('EmailAddress')
df1 = df[~mask]
df2 = df[mask]
print(df1)
print(df2)
# Output df1:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
3 24233 [email protected] Robert AUS 32 2 5
# Output df2:
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
I have a dataframe of duplicated Email Addresses.
ID | EmailAddress | Name | Country | Distance | IDLen | NonNAN |
---|---|---|---|---|---|---|
39203920 | [email protected] | John | UK | 12 | 8 | 6 |
32323 | [email protected] | NaN | UK | 12 | 5 | 5 |
I have created two additional columns that calculate the length of the ID and how many non NaN fields are for each row. I would like to create 2 new dataframes:
-
df1. Where the duplicate row has either the higher
NonNAN
value; or if they’re the same, pick the lowestIDLen
. -
df2. The remaining rows
I was thinking of using the df.duplicated()
function but it only looks at first or last and I need something more sophisticated.
Thanks in advance.
You can create a boolean mask of rows you want to select for df1
– then automatically the inverse of this mask will select the rows for df2
.
For each group where EmailAddress
is the same:
-
mask
selects all the rows whereNonNAN
has a maximum value. -
if
mask
selects multiple values –it must also select the rows where
IDLen
is the minimum. -
if
mask
still selects multiple values –just take the first one.
def f(df):
mask = df['NonNAN'] == df['NonNAN'].max()
if mask.sum() > 1:
mask = mask & (df['IDLen'] == df.loc[mask, 'IDLen'].min())
if mask.sum() > 1:
mask.iloc[mask.argmax()+1 : ] = False
return mask
mask = df.groupby('EmailAddress', group_keys=False).apply(f)
mask = mask.reindex(df.index)
df1 = df[mask]
df2 = df[~mask]
Will this work for you? The key is to sort the data, then apply df.duplicated()
, which has very high efficiency rather than looping through each record like .apply(lambda)
functions
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': [39203920, 32323, 22222, 392999],
'EmailAddress': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Name': ['John', np.nan, 'Jane', 'John'],
'Country': ['UK', 'UK', 'UK', 'UK'],
'Distance': [12, 12, 12, 12],
'IDLen': [8, 5, 5, 6],
'NonNAN': [6, 5, 6, 6] })
df = df.sort_values(['EmailAddress', 'NonNAN', 'IDLen'], ascending=[True, False, True])
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
1 32323 [email protected] NaN UK 12 5 5
3 392999 [email protected] John UK 12 6 6
0 39203920 [email protected] John UK 12 8 6
Based on your rules, I have sorted the data so that the desired record is located first. When df.duplicated()
is applied on EmailAddress
, the first record will be kept
df1 = df[~df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
3 392999 [email protected] John UK 12 6 6
df2 = df[df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
0 39203920 [email protected] John UK 12 8 6
If your ID
column is numerical (ie, not alphanumeric), you can sort based on ascending ID
, and there is no need for the column IDLen
(because you would like the shortest one if ‘NonNAN’ is the same)
You can absolutely use duplicated
, you just have to sort your data according to your conditions.
Given df
:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
3 24233 [email protected] Robert AUS 32 2 5
Doing:
df = df.sort_values(['NonNAN', 'ID'], ascending=[False, True])
mask = df.duplicated('EmailAddress')
df1 = df[~mask]
df2 = df[mask]
print(df1)
print(df2)
# Output df1:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
3 24233 [email protected] Robert AUS 32 2 5
# Output df2:
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4