Skipping the row if there are more than 2 fields are empty

Question:

First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.

Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.

I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.

I have tried using dropna but it deleted all the columns of the table.

My code:

import numpy as np
import pandas as pd

import matplotlib 
import matplotlib.pyplot as pp

%matplotlib inline

# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv') 

#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)

# Fill in data with mean value. 
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
    hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)

My dataset:

Country Name 2001 2002 2003 2004

Philippines 71

Malta 62 58 60 58

Singapore 60 56

Malaysia 58 57 55

Ireland 47 41 34 34

Georgia 38 41 24 38

Costa Rica

Asked By: codegekJohn

||

Answers:

Try this

hightech_export.dropna(thresh=2, inplace=True)

in place of the line of code

hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
Answered By: Rakesh Kumbi

Ok try this …

import pandas as pd
import numpy as np

data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}

df1=pd.DataFrame.from_records(data1)

Check the df

df1

    Age Name    Pay
0   20.0    Tom NaN
1   NaN NaN NaN
2   40.0    Mary    20.0
3   30.0    Jane    25.0

record with index 1 has 3 missing values…

Replace and make missing values None

df1 = df1.replace({pd.np.nan: None})

Now write function to count missing values per row…. and to create a list

def count_na(lst):
    missing = [n for n in lst if not n]
    return len(missing)

missing_data=[]
for index,n in df1.iterrows():
    missing_data.append(count_na(list(n)))

Use this list as a new Column in the Dataframe

df1['missing']=missing_data

df1 should look like this

Age     Name    Pay    missing

0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0

So filtering becomes easy….

# Now only take records with <2 missing
df1[df1.missing<2]

Hope that helps…

Answered By: Tim Seed

You can make use of .isnull() method for doing your first task.

Replace this:

hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)

with:

hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]

A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.

Code could be:

result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
             np.nan, df.agg('mean'))

With your example data, it gives as expected:

  Country Name  2001   2002       2003  2004
1        Malta  62.0  58.00  60.000000  58.0
2    Singapore  60.0  49.25  39.333333  56.0
3     Malaysia  58.0  57.00  39.333333  55.0
4      Ireland  47.0  41.00  34.000000  34.0
5      Georgia  38.0  41.00  24.000000  38.0
Answered By: Serge Ballesta
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.