Find out the percentage of missing values in each column in the given dataset
Question:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))
input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0
and the output should be
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
Answers:
How about this? I think I actually found something similar on here once before, but I’m not seeing it now…
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
And if you want the missing percentages sorted, follow the above with:
missing_value_df.sort_values('percent_missing', inplace=True)
As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:
percent_missing = df.isnull().sum() * 100 / len(df)
Update let’s use mean
with isnull
:
df.isnull().mean() * 100
Output:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
IIUC:
df.isnull().sum() / df.shape[0] * 100.00
Output:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
To cover all missing values and round the results:
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)
The output:
Out[556]:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
import numpy as np
import pandas as pd
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'],
'age': [22, np.nan, 23, 24, 25],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'Test1_Score': [4, np.nan, 0, 0, 0],
'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])
results
first_name last_name age sex Test1_Score Test2_Score
0 Jason Miller 22.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina NaN 23.0 f 0.0 NaN
3 Jake Milner 24.0 m 0.0 0.0
4 Amy Cooze 25.0 f 0.0 0.0
You can use following function, which will give you output in Dataframe
- Zero Values
- Missing Values
- % of Total Values
- Total Zero Missing Values
- % Total Zero Missing Values
- Data Type
Just copy and paste following function and call it by passing your pandas Dataframe
def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(results)
Output
Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
last_name 0 2 40.0 2 40.0 object
Test2_Score 2 2 40.0 4 80.0 float64
first_name 0 1 20.0 1 20.0 object
age 0 1 20.0 1 20.0 float64
sex 0 1 20.0 1 20.0 object
Test1_Score 3 1 20.0 4 80.0 float64
If you want to keep it simple then you can use following function to get missing values in %
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(results)
Test2_Score 40.0
last_name 40.0
Test1_Score 20.0
sex 20.0
age 20.0
first_name 20.0
dtype: float64
The solution you’re looking for is :
round(df.isnull().mean()*100,2)
This will round up the percentage upto 2 decimal places
Another way to do this is
round((df.isnull().sum()*100)/len(df),2)
but this is not efficient as using mean() is.
If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.
Input:
In [1]:
all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Output :
Out[1]:
Missing Ratio
left_eyebrow_outer_end_x 68.435239
left_eyebrow_outer_end_y 68.435239
right_eyebrow_outer_end_y 68.279189
right_eyebrow_outer_end_x 68.279189
left_eye_outer_corner_x 67.839410
left_eye_outer_corner_y 67.839410
right_eye_inner_corner_x 67.825223
right_eye_inner_corner_y 67.825223
right_eye_outer_corner_x 67.825223
right_eye_outer_corner_y 67.825223
mouth_left_corner_y 67.811037
mouth_left_corner_x 67.811037
left_eyebrow_inner_end_x 67.796851
left_eyebrow_inner_end_y 67.796851
right_eyebrow_inner_end_y 67.796851
mouth_right_corner_x 67.796851
mouth_right_corner_y 67.796851
right_eyebrow_inner_end_x 67.796851
left_eye_inner_corner_x 67.782664
left_eye_inner_corner_y 67.782664
For me I did it like that :
def missing_percent(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)
# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_columns
Let’s break down your ask
- you want the percentage of missing value
- it should be sorted in ascending order and the values to be rounded to 2 floating point
Explanation:
- dhr[fill_cols].isnull().sum() – gives the total number of missing values column wise
- dhr.shape[0] – gives the total number of rows
- (dhr[fill_cols].isnull().sum()/dhr.shape[0]) – gives you a series with percentage as values and column names as index
- since the output is a series you can round and sort based on the values
code:
(dhr[fill_cols].isnull().sum()/dhr.shape[0]).round(2).sort_values()
single line solution
df.isnull().mean().round(4).mul(100).sort_values(ascending=False)
import numpy as np
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.loc[np.isnan(df['Product_Base_Margin']),['Product_Base_Margin']]=df['Product_Base_Margin'].mean()
print(round(100*(df.isnull().sum()/len(df.index)), 2))
Try this solution
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(100*(df.isnull().sum()/len(df.index)),2))
The best solution I have found – (Only shows the missing columns)
missing_values = [feature for feature in df.columns if df[feature].isnull().sum() > 1]
for feature in missing_values:
print(f"{feature} {np.round(df[feature].isnull().mean(), 4)}% missing values")
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.isna().sum()
Output:
Ord_id 0
Prod_id 0
Ship_id 0
Cust_id 0
Sales 20
Discount 55
Order_Quantity 55
Profit 55
Shipping_Cost 55
Product_Base_Margin 109
dtype: int64
df.shape
Output: (8399, 10)
# for share [0; 1] of nan in each column
df.isna().sum() / df.shape[0]
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0024 # (20 / 8399)
Discount 0.0065 # (55 / 8399)
Order_Quantity 0.0065 # (55 / 8399)
Profit 0.0065 # (55 / 8399)
Shipping_Cost 0.0065 # (55 / 8399)
Product_Base_Margin 0.0130 # (109 / 8399)
dtype: float64
# for percent [0; 100] of nan in each column
df.isna().sum() / (df.shape[0] / 100)
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.2381 # (20 / (8399 / 100))
Discount 0.6548 # (55 / (8399 / 100))
Order_Quantity 0.6548 # (55 / (8399 / 100))
Profit 0.6548 # (55 / (8399 / 100))
Shipping_Cost 0.6548 # (55 / (8399 / 100))
Product_Base_Margin 1.2978 # (109 / (8399 / 100))
dtype: float64
# for share [0; 1] of nan in dataframe
df.isna().sum() / (df.shape[0] * df.shape[1])
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0002 # (20 / (8399 * 10))
Discount 0.0007 # (55 / (8399 * 10))
Order_Quantity 0.0007 # (55 / (8399 * 10))
Profit 0.0007 # (55 / (8399 * 10))
Shipping_Cost 0.0007 # (55 / (8399 * 10))
Product_Base_Margin 0.0013 # (109 / (8399 * 10))
dtype: float64
# for percent [0; 100] of nan in dataframe
df.isna().sum() / ((df.shape[0] * df.shape[1]) / 100)
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0238 # (20 / ((8399 * 10) / 100))
Discount 0.0655 # (55 / ((8399 * 10) / 100))
Order_Quantity 0.0655 # (55 / ((8399 * 10) / 100))
Profit 0.0655 # (55 / ((8399 * 10) / 100))
Shipping_Cost 0.0655 # (55 / ((8399 * 10) / 100))
Product_Base_Margin 0.1298 # (109 / ((8399 * 10) / 100))
dtype: float64
One-liner
I’m wondering nobody takes advantage of the size and count? It seems the shortest (and probably fastest) way to do it.
df.apply(lambda x: 1-(x.count()/x.size))
Resulting in:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.002381
Discount 0.006548
Order_Quantity 0.006548
Profit 0.006548
Shipping_Cost 0.006548
Product_Base_Margin 0.012978
dtype: float64
If you find any reason why this is not a good way, please comment
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))
input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0
and the output should be
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
How about this? I think I actually found something similar on here once before, but I’m not seeing it now…
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
And if you want the missing percentages sorted, follow the above with:
missing_value_df.sort_values('percent_missing', inplace=True)
As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:
percent_missing = df.isnull().sum() * 100 / len(df)
Update let’s use mean
with isnull
:
df.isnull().mean() * 100
Output:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
IIUC:
df.isnull().sum() / df.shape[0] * 100.00
Output:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
To cover all missing values and round the results:
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)
The output:
Out[556]:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
import numpy as np
import pandas as pd
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'],
'age': [22, np.nan, 23, 24, 25],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'Test1_Score': [4, np.nan, 0, 0, 0],
'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])
results
first_name last_name age sex Test1_Score Test2_Score
0 Jason Miller 22.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina NaN 23.0 f 0.0 NaN
3 Jake Milner 24.0 m 0.0 0.0
4 Amy Cooze 25.0 f 0.0 0.0
You can use following function, which will give you output in Dataframe
- Zero Values
- Missing Values
- % of Total Values
- Total Zero Missing Values
- % Total Zero Missing Values
- Data Type
Just copy and paste following function and call it by passing your pandas Dataframe
def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(results)
Output
Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
last_name 0 2 40.0 2 40.0 object
Test2_Score 2 2 40.0 4 80.0 float64
first_name 0 1 20.0 1 20.0 object
age 0 1 20.0 1 20.0 float64
sex 0 1 20.0 1 20.0 object
Test1_Score 3 1 20.0 4 80.0 float64
If you want to keep it simple then you can use following function to get missing values in %
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(results)
Test2_Score 40.0
last_name 40.0
Test1_Score 20.0
sex 20.0
age 20.0
first_name 20.0
dtype: float64
The solution you’re looking for is :
round(df.isnull().mean()*100,2)
This will round up the percentage upto 2 decimal places
Another way to do this is
round((df.isnull().sum()*100)/len(df),2)
but this is not efficient as using mean() is.
If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.
Input:
In [1]:
all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Output :
Out[1]:
Missing Ratio
left_eyebrow_outer_end_x 68.435239
left_eyebrow_outer_end_y 68.435239
right_eyebrow_outer_end_y 68.279189
right_eyebrow_outer_end_x 68.279189
left_eye_outer_corner_x 67.839410
left_eye_outer_corner_y 67.839410
right_eye_inner_corner_x 67.825223
right_eye_inner_corner_y 67.825223
right_eye_outer_corner_x 67.825223
right_eye_outer_corner_y 67.825223
mouth_left_corner_y 67.811037
mouth_left_corner_x 67.811037
left_eyebrow_inner_end_x 67.796851
left_eyebrow_inner_end_y 67.796851
right_eyebrow_inner_end_y 67.796851
mouth_right_corner_x 67.796851
mouth_right_corner_y 67.796851
right_eyebrow_inner_end_x 67.796851
left_eye_inner_corner_x 67.782664
left_eye_inner_corner_y 67.782664
For me I did it like that :
def missing_percent(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)
# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_columns
Let’s break down your ask
- you want the percentage of missing value
- it should be sorted in ascending order and the values to be rounded to 2 floating point
Explanation:
- dhr[fill_cols].isnull().sum() – gives the total number of missing values column wise
- dhr.shape[0] – gives the total number of rows
- (dhr[fill_cols].isnull().sum()/dhr.shape[0]) – gives you a series with percentage as values and column names as index
- since the output is a series you can round and sort based on the values
code:
(dhr[fill_cols].isnull().sum()/dhr.shape[0]).round(2).sort_values()
single line solution
df.isnull().mean().round(4).mul(100).sort_values(ascending=False)
import numpy as np
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.loc[np.isnan(df['Product_Base_Margin']),['Product_Base_Margin']]=df['Product_Base_Margin'].mean()
print(round(100*(df.isnull().sum()/len(df.index)), 2))
Try this solution
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(100*(df.isnull().sum()/len(df.index)),2))
The best solution I have found – (Only shows the missing columns)
missing_values = [feature for feature in df.columns if df[feature].isnull().sum() > 1]
for feature in missing_values:
print(f"{feature} {np.round(df[feature].isnull().mean(), 4)}% missing values")
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.isna().sum()
Output:
Ord_id 0
Prod_id 0
Ship_id 0
Cust_id 0
Sales 20
Discount 55
Order_Quantity 55
Profit 55
Shipping_Cost 55
Product_Base_Margin 109
dtype: int64
df.shape
Output: (8399, 10)
# for share [0; 1] of nan in each column
df.isna().sum() / df.shape[0]
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0024 # (20 / 8399)
Discount 0.0065 # (55 / 8399)
Order_Quantity 0.0065 # (55 / 8399)
Profit 0.0065 # (55 / 8399)
Shipping_Cost 0.0065 # (55 / 8399)
Product_Base_Margin 0.0130 # (109 / 8399)
dtype: float64
# for percent [0; 100] of nan in each column
df.isna().sum() / (df.shape[0] / 100)
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.2381 # (20 / (8399 / 100))
Discount 0.6548 # (55 / (8399 / 100))
Order_Quantity 0.6548 # (55 / (8399 / 100))
Profit 0.6548 # (55 / (8399 / 100))
Shipping_Cost 0.6548 # (55 / (8399 / 100))
Product_Base_Margin 1.2978 # (109 / (8399 / 100))
dtype: float64
# for share [0; 1] of nan in dataframe
df.isna().sum() / (df.shape[0] * df.shape[1])
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0002 # (20 / (8399 * 10))
Discount 0.0007 # (55 / (8399 * 10))
Order_Quantity 0.0007 # (55 / (8399 * 10))
Profit 0.0007 # (55 / (8399 * 10))
Shipping_Cost 0.0007 # (55 / (8399 * 10))
Product_Base_Margin 0.0013 # (109 / (8399 * 10))
dtype: float64
# for percent [0; 100] of nan in dataframe
df.isna().sum() / ((df.shape[0] * df.shape[1]) / 100)
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0238 # (20 / ((8399 * 10) / 100))
Discount 0.0655 # (55 / ((8399 * 10) / 100))
Order_Quantity 0.0655 # (55 / ((8399 * 10) / 100))
Profit 0.0655 # (55 / ((8399 * 10) / 100))
Shipping_Cost 0.0655 # (55 / ((8399 * 10) / 100))
Product_Base_Margin 0.1298 # (109 / ((8399 * 10) / 100))
dtype: float64
One-liner
I’m wondering nobody takes advantage of the size and count? It seems the shortest (and probably fastest) way to do it.
df.apply(lambda x: 1-(x.count()/x.size))
Resulting in:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.002381
Discount 0.006548
Order_Quantity 0.006548
Profit 0.006548
Shipping_Cost 0.006548
Product_Base_Margin 0.012978
dtype: float64
If you find any reason why this is not a good way, please comment