Adding a df column based on other column with multiple values map to the same new column value
Question:
I have a dataframe like this:
df1 = pd.DataFrame({'col1' : ['cat', 'cat', 'dog', 'green', 'blue']})
and I want a new column that gives the category, like this:
dfoutput = pd.DataFrame({'col1' : ['cat', 'cat', 'dog', 'green', 'blue'],
'col2' : ['animal', 'animal', 'animal', 'color', 'color']})
I know I could do it inefficiently using .loc
:
df1.loc[df1['col1'] == 'cat','col2'] = 'animal'
df1.loc[df1['col1'] == 'dog','col2'] = 'animal'
How do I combine cat
and dog
to both be animal
? This doesn’t work:
df1.loc[df1['col1'] == 'cat' | df1['col1'] == 'dog','col2'] = 'animal'
Answers:
Build your dict
then do map
d={'dog':'ani','cat':'ani','green':'color','blue':'color'}
df1['col2']=df1.col1.map(d)
df1
col1 col2
0 cat ani
1 cat ani
2 dog ani
3 green color
4 blue color
Since multiple items may belong to a single category I suggest you start with a dictionary mapping category to items:
cat_item = {'animal': ['cat', 'dog'], 'color': ['green', 'blue']}
You’ll probably find this easier to maintain. Then reverse your dictionary using a dictionary comprehension, followed by pd.Series.map
:
item_cat = {w: k for k, v in cat_item.items() for w in v}
df1['col2'] = df1['col1'].map(item_cat)
print(df1)
col1 col2
0 cat animal
1 cat animal
2 dog animal
3 green color
4 blue color
You can also use pd.Series.replace
, but this will be generally less efficient.
you could also try using np.select like this:
options = [(df1.col1.str.contains('cat|dog')),
(df1.col1.str.contains('green|blue'))]
settings = ['animal', 'color']
df1['setting'] = np.select(options,settings)
I’ve found this works quite fast even with very big dataframes
I have a dataframe like this:
df1 = pd.DataFrame({'col1' : ['cat', 'cat', 'dog', 'green', 'blue']})
and I want a new column that gives the category, like this:
dfoutput = pd.DataFrame({'col1' : ['cat', 'cat', 'dog', 'green', 'blue'],
'col2' : ['animal', 'animal', 'animal', 'color', 'color']})
I know I could do it inefficiently using .loc
:
df1.loc[df1['col1'] == 'cat','col2'] = 'animal'
df1.loc[df1['col1'] == 'dog','col2'] = 'animal'
How do I combine cat
and dog
to both be animal
? This doesn’t work:
df1.loc[df1['col1'] == 'cat' | df1['col1'] == 'dog','col2'] = 'animal'
Build your dict
then do map
d={'dog':'ani','cat':'ani','green':'color','blue':'color'}
df1['col2']=df1.col1.map(d)
df1
col1 col2
0 cat ani
1 cat ani
2 dog ani
3 green color
4 blue color
Since multiple items may belong to a single category I suggest you start with a dictionary mapping category to items:
cat_item = {'animal': ['cat', 'dog'], 'color': ['green', 'blue']}
You’ll probably find this easier to maintain. Then reverse your dictionary using a dictionary comprehension, followed by pd.Series.map
:
item_cat = {w: k for k, v in cat_item.items() for w in v}
df1['col2'] = df1['col1'].map(item_cat)
print(df1)
col1 col2
0 cat animal
1 cat animal
2 dog animal
3 green color
4 blue color
You can also use pd.Series.replace
, but this will be generally less efficient.
you could also try using np.select like this:
options = [(df1.col1.str.contains('cat|dog')),
(df1.col1.str.contains('green|blue'))]
settings = ['animal', 'color']
df1['setting'] = np.select(options,settings)
I’ve found this works quite fast even with very big dataframes