Filtering pandas dataframe when column contains stings

Question:

I have a dataframe that preexists in this structure:

import pandas as pd
d={'colA':['1','2','3','3','3'],'colB':['NaN','4','2','this','that']}
mydata=pd.DataFrame(data=d)

ColA is integers saved as strings
ColB are all strings but contain a mix of integers, NaN and real strings.

I want to create a new column (colC) that checks if the integers in colB are greater than the integers in colA. But I can’t figure out how to deal with the strings and NaNs.

The final dataframe should look like this:

d={'colA':[1,2,3,3,3],'colB':['NaN',4,2,'this','that'],'colC':['NaN','Yes','No','NaN','NaN']}
mydata_new=pd.DataFrame(data=d) 

Thanks

Asked By: person

||

Answers:

Use to_numeric with errors='coerce' for numeric and compare by Series.gt and Series.le in numpy.select:

s1 = pd.to_numeric(mydata.colA, errors='coerce')
s2 = pd.to_numeric(mydata.colB, errors='coerce')

mydata['colC'] = np.select([s2.gt(s1), s2.le(s1)], ['Yes', 'No'], None)
print (mydata)
  colA  colB  colC
0    1   NaN  None
1    2     4   Yes
2    3     2    No
3    3  this  None
4    3  that  None
Answered By: jezrael

First you can convert all the df to string class for compare different class values correctly and then you can compare them one by one.
One solution can be:

mydata = mydata.astype(str)
colC = []
i = 0
while i<len(mydata):
    if "nan" in mydata.loc[i].values:
        colC.append(np.nan)
    else:
        # Here you have two options. The first one is more abstract but faster than comparing strings
        if len(set(mydata.loc[i].values)) == 1:
        # if mydata["colA"].values[i] == mydata["colB"].values[i]:
            colC.append("Yes")
        else:
            colC.append("No")
    i = i + 1
mydata["colC"] = colC

RESULTING ( I changed value of index two in order to get one "YES"):

  colA  colB colC
0    1   nan  NaN
1    2     2  Yes
2    3     2   No
3    3  this   No
4    3  that   No
Answered By: Javier Chornet

Using apply is a good way of handling this kind of computation:

def compareIntAndStrings(x, y):
    try:
        x = int(x)
        y = int(y)
    except ValueError:
        return "NaN"
    return "Yes" if x < y else "No"

mydata['colC'] = mydata.apply(lambda x: compareIntAndStrings(x['colA'], x['colB']), axis=1)

While this solution is easier ton understand and reuse, it should be considered that the performance cost is not negligible.
It is also a better practice to replace the "NaN" strings to missing datas.

Answered By: Zamani Maxime