Pandas: if row in column A contains "x", write "y" to row in column B
Question:
For pandas
, I’m looking for a way to write conditional values to each row in column B, based on substrings for corresponding rows in column A.
So if cell in A
contains "BULL"
, write "Long"
to B
. Or if cell in A
contains "BEAR"
, write "Short"
to B
.
Desired output:
A B
"BULL APPLE X5" "Long"
"BEAR APPLE X5" "Short"
"BULL APPLE X5" "Long"
B is initially empty: df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']],columns=['A','B'])
Answers:
Also, for populating the df['B']
you can try the below method –
def applyFunc(s):
if s == 'BULL':
return 'Long'
elif s == 'BEAR':
return 'Short'
return ''
df['B'] = df['A'].apply(applyFunc)
df
>>
A B
0 BULL Long
1 BEAR Short
2 BULL Long
What the apply
function does, is that for each row value of df['A']
, it calls the applyFunc
function with the parameter as the value of that row , and the returned value is put into the same row for df['B']
, what really happens behind the scene is a bit different though, the value is not directly put into df['B']
but rather a new Series
is created and at the end, the new Series is assigned to df['B']
.
Your code would error as you creating the Dataframe incorrectly, just create a single column A
then add B
based on A
:
import pandas as pd
df = pd.DataFrame(["BULL","BEAR","BULL"], columns=['A'])
df["B"] = ["Long" if ele == "BULL" else "Short" for ele in df["A"]]
print(df)
A B
0 BULL Long
1 BEAR Short
2 BULL Long
Or do you logic with the data before you create the dataframe:
import pandas as pd
data = ["BULL","BEAR","BULL"]
data2 = ["Long" if ele == "BULL" else "Short" for ele in data]
df = pd.DataFrame(list(zip(data, data2)), columns=['A','B'])
print(df)
A B
0 BULL Long
1 BEAR Short
2 BULL Long
For your edit:
df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']], columns=['A','B'])
df["B"] = df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
print(df)
A B
0 BULL APPLE X5 Long
1 BEAR APPLE X5 Short
2 BULL APPLE X5 Long
Or just add the column after:
df = pd.DataFrame(['BULL APPLE X5','BEAR APPLE X5','BLL APPLE X5'], columns=['A'])
df["B"] = df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
print(df)
Or using contains:
df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']], columns=['A','B'])
df["B"][df['A'].str.contains("BULL")] = "Long"
df["B"][df['A'].str.contains("BEAR")] = "Short"
print(df)
0 BULL APPLE X5 Long
1 BEAR APPLE X5 Short
2 BULL APPLE X5 Long
You could use str.extract
to search for regex pattern BULL|BEAR
, and then use Series.map
to replace those strings with Long
or Short
:
In [50]: df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']],columns=['A','B'])
In [51]: df['B'] = df['A'].str.extract(r'(BULL|BEAR)').map({'BULL':'Long', 'BEAR':'Short'})
In [55]: df
Out[55]:
A B
0 BULL APPLE X5 Long
1 BEAR APPLE X5 Short
2 BULL APPLE X5 Long
However, forming the intermediate Series with str.extract
is quite slow compared to df['A'].map(lambda x:...)
. Using IPython’s %timeit
to time the benchmarks,
In [5]: df = pd.concat([df]*10000)
In [6]: %timeit df['A'].str.extract(r'(BULL|BEAR)').map({'BULL':'Long', 'BEAR':'Short'})
10 loops, best of 3: 39.7 ms per loop
In [7]: %timeit df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
100 loops, best of 3: 4.98 ms per loop
The majority of time is spent in str.extract
:
In [8]: %timeit df['A'].str.extract(r'(BULL|BEAR)')
10 loops, best of 3: 37.1 ms per loop
while the call to Series.map
is relatively fast:
In [9]: x = df['A'].str.extract(r'(BULL|BEAR)')
In [10]: %timeit x.map({'BULL':'Long', 'BEAR':'Short'})
1000 loops, best of 3: 1.82 ms per loop
Alternatively you can use np.select
if you don’t mind using NumPy:
import numpy as np
df['B'] = np.select(
[df['A'].str.contains('BULL'), df['A'].str.contains('BEAR')],
['Long', 'Short'],
default=np.nan,
)
If neither BULL or BEAR is found, NaN is returned.
For pandas
, I’m looking for a way to write conditional values to each row in column B, based on substrings for corresponding rows in column A.
So if cell in A
contains "BULL"
, write "Long"
to B
. Or if cell in A
contains "BEAR"
, write "Short"
to B
.
Desired output:
A B
"BULL APPLE X5" "Long"
"BEAR APPLE X5" "Short"
"BULL APPLE X5" "Long"
B is initially empty: df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']],columns=['A','B'])
Also, for populating the df['B']
you can try the below method –
def applyFunc(s):
if s == 'BULL':
return 'Long'
elif s == 'BEAR':
return 'Short'
return ''
df['B'] = df['A'].apply(applyFunc)
df
>>
A B
0 BULL Long
1 BEAR Short
2 BULL Long
What the apply
function does, is that for each row value of df['A']
, it calls the applyFunc
function with the parameter as the value of that row , and the returned value is put into the same row for df['B']
, what really happens behind the scene is a bit different though, the value is not directly put into df['B']
but rather a new Series
is created and at the end, the new Series is assigned to df['B']
.
Your code would error as you creating the Dataframe incorrectly, just create a single column A
then add B
based on A
:
import pandas as pd
df = pd.DataFrame(["BULL","BEAR","BULL"], columns=['A'])
df["B"] = ["Long" if ele == "BULL" else "Short" for ele in df["A"]]
print(df)
A B
0 BULL Long
1 BEAR Short
2 BULL Long
Or do you logic with the data before you create the dataframe:
import pandas as pd
data = ["BULL","BEAR","BULL"]
data2 = ["Long" if ele == "BULL" else "Short" for ele in data]
df = pd.DataFrame(list(zip(data, data2)), columns=['A','B'])
print(df)
A B
0 BULL Long
1 BEAR Short
2 BULL Long
For your edit:
df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']], columns=['A','B'])
df["B"] = df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
print(df)
A B
0 BULL APPLE X5 Long
1 BEAR APPLE X5 Short
2 BULL APPLE X5 Long
Or just add the column after:
df = pd.DataFrame(['BULL APPLE X5','BEAR APPLE X5','BLL APPLE X5'], columns=['A'])
df["B"] = df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
print(df)
Or using contains:
df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']], columns=['A','B'])
df["B"][df['A'].str.contains("BULL")] = "Long"
df["B"][df['A'].str.contains("BEAR")] = "Short"
print(df)
0 BULL APPLE X5 Long
1 BEAR APPLE X5 Short
2 BULL APPLE X5 Long
You could use str.extract
to search for regex pattern BULL|BEAR
, and then use Series.map
to replace those strings with Long
or Short
:
In [50]: df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']],columns=['A','B'])
In [51]: df['B'] = df['A'].str.extract(r'(BULL|BEAR)').map({'BULL':'Long', 'BEAR':'Short'})
In [55]: df
Out[55]:
A B
0 BULL APPLE X5 Long
1 BEAR APPLE X5 Short
2 BULL APPLE X5 Long
However, forming the intermediate Series with str.extract
is quite slow compared to df['A'].map(lambda x:...)
. Using IPython’s %timeit
to time the benchmarks,
In [5]: df = pd.concat([df]*10000)
In [6]: %timeit df['A'].str.extract(r'(BULL|BEAR)').map({'BULL':'Long', 'BEAR':'Short'})
10 loops, best of 3: 39.7 ms per loop
In [7]: %timeit df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
100 loops, best of 3: 4.98 ms per loop
The majority of time is spent in str.extract
:
In [8]: %timeit df['A'].str.extract(r'(BULL|BEAR)')
10 loops, best of 3: 37.1 ms per loop
while the call to Series.map
is relatively fast:
In [9]: x = df['A'].str.extract(r'(BULL|BEAR)')
In [10]: %timeit x.map({'BULL':'Long', 'BEAR':'Short'})
1000 loops, best of 3: 1.82 ms per loop
Alternatively you can use np.select
if you don’t mind using NumPy:
import numpy as np
df['B'] = np.select(
[df['A'].str.contains('BULL'), df['A'].str.contains('BEAR')],
['Long', 'Short'],
default=np.nan,
)
If neither BULL or BEAR is found, NaN is returned.