Equivalent of R/ifelse in Python/Pandas? Compare string columns?
Question:
My goal is comparing between two columns and add the result column. R uses ifelse but I need to know pandas’s way.
R
> head(mau.payment)
log_month user_id install_month payment
1 2013-06 1 2013-04 0
2 2013-06 2 2013-04 0
3 2013-06 3 2013-04 14994
> mau.payment$user.type <-ifelse(mau.payment$install_month == mau.payment$log_month, "install", "existing")
> head(mau.payment)
log_month user_id install_month payment user.type
1 2013-06 1 2013-04 0 existing
2 2013-06 2 2013-04 0 existing
3 2013-06 3 2013-04 14994 existing
4 2013-06 4 2013-04 0 existing
5 2013-06 6 2013-04 0 existing
6 2013-06 7 2013-04 0 existing
Pandas
>>> maupayment
user_id log_month install_month
1 2013-06 2013-04 0
2013-07 2013-04 0
2 2013-06 2013-04 0
3 2013-06 2013-04 14994
I tried some cases but did not work. It seems that string comparison does not work.
>>>np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
TypeError: 'str' object cannot be interpreted as an integer
Could you help me please?
Pandas and numpy version.
>>> pd.version.version
'0.16.2'
>>> np.version.full_version
'1.9.2'
After update the versions, it worked!
>>> np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
array(['existing', 'install', 'existing', ..., 'install', 'install',
'install'],
dtype='<U8')
Answers:
You have to upgrade pandas to last version, because in version 0.17.1
it works very well.
Sample (first value in column install_month
is changed for matching):
print maupayment
log_month user_id install_month payment
1 2013-06 1 2013-06 0
2 2013-06 2 2013-04 0
3 2013-06 3 2013-04 14994
print np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
['install' 'existing' 'existing']
One option is to use an anonymous function in combination with Pandas’s apply function:
Setup some branching logic in a function:
def if_this_else_that(x, list_of_checks, yes_label, no_label):
if x in list_of_checks:
res = yes_label
else:
res = no_label
return(res)
This takes the x from lambda (see below), a list of things to look for, the yes label, and the no label.
For example, say we are looking at the IMDB dataset (imdb_df):
…and I want to add a new column called “new_rating” that shows whether the movie is mature or not.
I can use Pandas apply function along with my branching logic above:
imdb_df['new_rating'] = imdb_df['Rated'].apply(lambda x: if_this_else_that(x, ['PG', 'PG-13'], 'not mature', 'mature'))
There are also times we need to combine this with another check. For example, some entries in the IMDB dataset are NaN. I can check for both NaN and the maturity rating as follows:
imdb_df['new_rating'] = imdb_df['Rated'].apply(lambda x: 'not provided' if x in ['nan'] else if_this_else_that(x, ['PG', 'PG-13'], 'not mature', 'mature'))
In this case my NaN was first converted to a string, but you can obviously do this with genuine NaNs as well.
My goal is comparing between two columns and add the result column. R uses ifelse but I need to know pandas’s way.
R
> head(mau.payment)
log_month user_id install_month payment
1 2013-06 1 2013-04 0
2 2013-06 2 2013-04 0
3 2013-06 3 2013-04 14994
> mau.payment$user.type <-ifelse(mau.payment$install_month == mau.payment$log_month, "install", "existing")
> head(mau.payment)
log_month user_id install_month payment user.type
1 2013-06 1 2013-04 0 existing
2 2013-06 2 2013-04 0 existing
3 2013-06 3 2013-04 14994 existing
4 2013-06 4 2013-04 0 existing
5 2013-06 6 2013-04 0 existing
6 2013-06 7 2013-04 0 existing
Pandas
>>> maupayment
user_id log_month install_month
1 2013-06 2013-04 0
2013-07 2013-04 0
2 2013-06 2013-04 0
3 2013-06 2013-04 14994
I tried some cases but did not work. It seems that string comparison does not work.
>>>np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
TypeError: 'str' object cannot be interpreted as an integer
Could you help me please?
Pandas and numpy version.
>>> pd.version.version
'0.16.2'
>>> np.version.full_version
'1.9.2'
After update the versions, it worked!
>>> np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
array(['existing', 'install', 'existing', ..., 'install', 'install',
'install'],
dtype='<U8')
You have to upgrade pandas to last version, because in version 0.17.1
it works very well.
Sample (first value in column install_month
is changed for matching):
print maupayment
log_month user_id install_month payment
1 2013-06 1 2013-06 0
2 2013-06 2 2013-04 0
3 2013-06 3 2013-04 14994
print np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
['install' 'existing' 'existing']
One option is to use an anonymous function in combination with Pandas’s apply function:
Setup some branching logic in a function:
def if_this_else_that(x, list_of_checks, yes_label, no_label):
if x in list_of_checks:
res = yes_label
else:
res = no_label
return(res)
This takes the x from lambda (see below), a list of things to look for, the yes label, and the no label.
For example, say we are looking at the IMDB dataset (imdb_df):
…and I want to add a new column called “new_rating” that shows whether the movie is mature or not.
I can use Pandas apply function along with my branching logic above:
imdb_df['new_rating'] = imdb_df['Rated'].apply(lambda x: if_this_else_that(x, ['PG', 'PG-13'], 'not mature', 'mature'))
There are also times we need to combine this with another check. For example, some entries in the IMDB dataset are NaN. I can check for both NaN and the maturity rating as follows:
imdb_df['new_rating'] = imdb_df['Rated'].apply(lambda x: 'not provided' if x in ['nan'] else if_this_else_that(x, ['PG', 'PG-13'], 'not mature', 'mature'))
In this case my NaN was first converted to a string, but you can obviously do this with genuine NaNs as well.