Splitting a dataframe based on condition

Question:

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.

df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]

The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").

df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]

How to handle such situation?

My df_with_medicalplanid looks like below:

wellthie_issuer_identifier       ...       medical_plan_id
0                   UHC99806       ...                  None
1                   UHC99806       ...                  None
Asked By: user1896796

||

Answers:

Use ==, not is, to test equality

Likewise, use != instead of is not for inequality.

is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.

Don’t repeat mask calculations

The Boolean masks you are creating are the most expensive part of your logic. It’s also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ (“tilde”), also accessible via operator.invert, to negate an existing mask.

Empty strings are different to null values

Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.

If you want to replace empty strings with null values, you can do so:

df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)

Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:

df['medical_plan_id'] = df['medical_plan_id'].fillna('')

If the difference matters, you need to know your data and apply the appropriate logic.

Semi-final solution

Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:

mask = df['medical_plan_id'].isnull()

df1 = df[mask]
df2 = df[~mask]

Final solution: avoid extra variables

Creating additional variables is something, as a programmer, you should look to avoid. In this case, there’s no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:

dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))

Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:

dfs = df.groupby(df['medical_plan_id'].isnull())

dfs.get_group(0)  # equivalent to dfs[0] from dict solution
dfs.get_group(1)  # equivalent to dfs[1] from dict solution

Example

Putting all the above in action:

df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
                   'values': [1, 2, 3, 4, 5, 6, 7]})

df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))

print(dfs[0], dfs[1], sep='n'*2)

   medical_plan_id  values
2           2134.0       3
3           4325.0       4
4           6543.0       5

   medical_plan_id  values
0              NaN       1
1              NaN       2
5              NaN       6
6              NaN       7
Answered By: jpp

Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).

Like this for instance:

cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)

_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.


Full example

import pandas as pd 

df_with_medicalplanid = pd.DataFrame({
    'medical_plan_id': ['214212','','12251','12421',''],
    'value': 1
})

cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)

print(df1)

Returns:

  medical_plan_id  value
0          214212      1
2           12251      1
3           12421      1
Answered By: Anton vBR
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket

print(df1)
Answered By: Ecube Analytics
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.