Filtering pandas dataframe when column contains stings
Question:
I have a dataframe that preexists in this structure:
import pandas as pd
d={'colA':['1','2','3','3','3'],'colB':['NaN','4','2','this','that']}
mydata=pd.DataFrame(data=d)
ColA is integers saved as strings
ColB are all strings but contain a mix of integers, NaN and real strings.
I want to create a new column (colC) that checks if the integers in colB are greater than the integers in colA. But I can’t figure out how to deal with the strings and NaNs.
The final dataframe should look like this:
d={'colA':[1,2,3,3,3],'colB':['NaN',4,2,'this','that'],'colC':['NaN','Yes','No','NaN','NaN']}
mydata_new=pd.DataFrame(data=d)
Thanks
Answers:
Use to_numeric
with errors='coerce'
for numeric and compare by Series.gt
and Series.le
in numpy.select
:
s1 = pd.to_numeric(mydata.colA, errors='coerce')
s2 = pd.to_numeric(mydata.colB, errors='coerce')
mydata['colC'] = np.select([s2.gt(s1), s2.le(s1)], ['Yes', 'No'], None)
print (mydata)
colA colB colC
0 1 NaN None
1 2 4 Yes
2 3 2 No
3 3 this None
4 3 that None
First you can convert all the df to string class for compare different class values correctly and then you can compare them one by one.
One solution can be:
mydata = mydata.astype(str)
colC = []
i = 0
while i<len(mydata):
if "nan" in mydata.loc[i].values:
colC.append(np.nan)
else:
# Here you have two options. The first one is more abstract but faster than comparing strings
if len(set(mydata.loc[i].values)) == 1:
# if mydata["colA"].values[i] == mydata["colB"].values[i]:
colC.append("Yes")
else:
colC.append("No")
i = i + 1
mydata["colC"] = colC
RESULTING ( I changed value of index two in order to get one "YES"):
colA colB colC
0 1 nan NaN
1 2 2 Yes
2 3 2 No
3 3 this No
4 3 that No
Using apply is a good way of handling this kind of computation:
def compareIntAndStrings(x, y):
try:
x = int(x)
y = int(y)
except ValueError:
return "NaN"
return "Yes" if x < y else "No"
mydata['colC'] = mydata.apply(lambda x: compareIntAndStrings(x['colA'], x['colB']), axis=1)
While this solution is easier ton understand and reuse, it should be considered that the performance cost is not negligible.
It is also a better practice to replace the "NaN" strings to missing datas.
I have a dataframe that preexists in this structure:
import pandas as pd
d={'colA':['1','2','3','3','3'],'colB':['NaN','4','2','this','that']}
mydata=pd.DataFrame(data=d)
ColA is integers saved as strings
ColB are all strings but contain a mix of integers, NaN and real strings.
I want to create a new column (colC) that checks if the integers in colB are greater than the integers in colA. But I can’t figure out how to deal with the strings and NaNs.
The final dataframe should look like this:
d={'colA':[1,2,3,3,3],'colB':['NaN',4,2,'this','that'],'colC':['NaN','Yes','No','NaN','NaN']}
mydata_new=pd.DataFrame(data=d)
Thanks
Use to_numeric
with errors='coerce'
for numeric and compare by Series.gt
and Series.le
in numpy.select
:
s1 = pd.to_numeric(mydata.colA, errors='coerce')
s2 = pd.to_numeric(mydata.colB, errors='coerce')
mydata['colC'] = np.select([s2.gt(s1), s2.le(s1)], ['Yes', 'No'], None)
print (mydata)
colA colB colC
0 1 NaN None
1 2 4 Yes
2 3 2 No
3 3 this None
4 3 that None
First you can convert all the df to string class for compare different class values correctly and then you can compare them one by one.
One solution can be:
mydata = mydata.astype(str)
colC = []
i = 0
while i<len(mydata):
if "nan" in mydata.loc[i].values:
colC.append(np.nan)
else:
# Here you have two options. The first one is more abstract but faster than comparing strings
if len(set(mydata.loc[i].values)) == 1:
# if mydata["colA"].values[i] == mydata["colB"].values[i]:
colC.append("Yes")
else:
colC.append("No")
i = i + 1
mydata["colC"] = colC
RESULTING ( I changed value of index two in order to get one "YES"):
colA colB colC
0 1 nan NaN
1 2 2 Yes
2 3 2 No
3 3 this No
4 3 that No
Using apply is a good way of handling this kind of computation:
def compareIntAndStrings(x, y):
try:
x = int(x)
y = int(y)
except ValueError:
return "NaN"
return "Yes" if x < y else "No"
mydata['colC'] = mydata.apply(lambda x: compareIntAndStrings(x['colA'], x['colB']), axis=1)
While this solution is easier ton understand and reuse, it should be considered that the performance cost is not negligible.
It is also a better practice to replace the "NaN" strings to missing datas.