Polars dataframe drop nans
Question:
I need to drop rows that have a nan value in any column. As for null values with drop_nulls()
df.drop_nulls()
but for nans. I have found that the method drop_nans
exist for Series but not for DataFrames
df['A'].drop_nans()
Pandas code that I’m using:
df = pd.DataFrame(
{
'A': [0, 0, 0, 1,None, 1],
'B': [1, 2, 2, 1,1, np.nan]
}
)
df.dropna()
Answers:
Try this:
import polars as pl
import numpy as np
# create a DataFrame with some NaN values
df = pl.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': ['foo', 'bar', 'app', 'ctx', 'mpq']
})
df.to_pandas().dropna()
Not sure why it currently only exists as a Series method.
You can use .filter()
to emulate the behaviour then call .drop_nulls()
>>> df.filter(pl.all(pl.col(pl.Float32, pl.Float64).is_not_nan())).drop_nulls()
shape: (4, 2)
┌─────┬─────┐
│ A | B │
│ --- | --- │
│ i64 | f64 │
╞═════╪═════╡
│ 0 | 1.0 │
│ 0 | 2.0 │
│ 0 | 2.0 │
│ 1 | 1.0 │
└─────┴─────┘
If you have mixed nulls and nans then the easiest thing to do is replace the nans with nulls then use drop_nulls()
df.with_columns(pl.col(pl.Float32, pl.Float64).fill_nan(None)).drop_nulls()
From inside out:
pl.col(pl.Float32, pl.Float64)
picks all the columns that are floats and hence able to be nan.
fill_nan(None)
replaces any nan value with, in this case, None which is a proper null
drop_nulls()
does exactly what it seems like it does.
As @jqurious suggested but with column names
df = pl.DataFrame(
{
'A': [0, 1.0, 1, np.nan, 2],
'B': ['1', '1','1','1', None]
}
)
# get all columns that have a float type
float_col = df.columns
float_col = [c for c in float_col if df[c].dtype in [pl.Float64, pl.Float32]]
df.filter(pl.all(pl.col(float_col).is_not_nan())).drop_nulls()
I need to drop rows that have a nan value in any column. As for null values with drop_nulls()
df.drop_nulls()
but for nans. I have found that the method drop_nans
exist for Series but not for DataFrames
df['A'].drop_nans()
Pandas code that I’m using:
df = pd.DataFrame(
{
'A': [0, 0, 0, 1,None, 1],
'B': [1, 2, 2, 1,1, np.nan]
}
)
df.dropna()
Try this:
import polars as pl
import numpy as np
# create a DataFrame with some NaN values
df = pl.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': ['foo', 'bar', 'app', 'ctx', 'mpq']
})
df.to_pandas().dropna()
Not sure why it currently only exists as a Series method.
You can use .filter()
to emulate the behaviour then call .drop_nulls()
>>> df.filter(pl.all(pl.col(pl.Float32, pl.Float64).is_not_nan())).drop_nulls()
shape: (4, 2)
┌─────┬─────┐
│ A | B │
│ --- | --- │
│ i64 | f64 │
╞═════╪═════╡
│ 0 | 1.0 │
│ 0 | 2.0 │
│ 0 | 2.0 │
│ 1 | 1.0 │
└─────┴─────┘
If you have mixed nulls and nans then the easiest thing to do is replace the nans with nulls then use drop_nulls()
df.with_columns(pl.col(pl.Float32, pl.Float64).fill_nan(None)).drop_nulls()
From inside out:
pl.col(pl.Float32, pl.Float64)
picks all the columns that are floats and hence able to be nan.
fill_nan(None)
replaces any nan value with, in this case, None which is a proper null
drop_nulls()
does exactly what it seems like it does.
As @jqurious suggested but with column names
df = pl.DataFrame(
{
'A': [0, 1.0, 1, np.nan, 2],
'B': ['1', '1','1','1', None]
}
)
# get all columns that have a float type
float_col = df.columns
float_col = [c for c in float_col if df[c].dtype in [pl.Float64, pl.Float32]]
df.filter(pl.all(pl.col(float_col).is_not_nan())).drop_nulls()