How to insert a list in a dataframe pandas
Question:
I have the following dataframe
id rule1 rule2 rule3
1 True True False
2 True True True
3 False False False
4 False True False
5 True False True
..
and a dictionary:
{'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
And I want to get an additional column list_of_rules
, which is a list of rules from the dictionary that are True
in the dataframe above.
id rule1 rule2 rule3 list_of_rules
1 True True False ['Rule one', 'Rule two']
2 True True True ['Rule one', 'Rule two', 'Rule three']
3 False False False ['']
4 False True False ['Rule two']
5 True False True ['Rule one', 'Rule three']
..
So far, I have the following solution:
df.loc[df['rule1'] == True, 'rule1'] = 'Rule one'
df.loc[df['rule2'] == True, 'rule2'] = 'Rule two'
df.loc[df['rule3'] == True, 'rule3'] = 'Rule three'
df.loc[df['rule1'] == False, 'rule1'] = ''
df.loc[df['rule2'] == False, 'rule2'] = ''
df.loc[df['rule3'] == False, 'rule3'] = ''
df['list_of_rules'] = df[['rule1', 'rule2', 'rule3']].apply("-".join, axis=1).str.strip('-').str.split('-')
df
which gives the following output:
id rule1 rule2 rule3 list_of_rules
1 True True False ['Rule one', 'Rule two']
2 True True True ['Rule one', 'Rule two', 'Rule three']
3 False False False ['']
4 False True False ['Rule two']
5 True False True ['Rule one', , 'Rule three']
..
Is there a way to fix fifth line, so there would be no double commas? Also, I would like to use the dictionary that I have above directly.
Thank you in advance
Answers:
Try this using a little trick with pandas.Dataframe.dot
:
import pandas as pd
data_dict = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'rule1': {0: True, 1: True, 2: False, 3: False, 4: True},
'rule2': {0: True, 1: True, 2: False, 3: True, 4: False},
'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}}
df = pd.DataFrame(data_dict)
df = df.set_index('id')
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
dfr = df.rename(columns=d)
df['list_of_rules'] = dfr.dot(dfr.columns+'-').str.strip('-').str.split('-')
df.reset_index()
Output:
id rule1 rule2 rule3 list_of_rules
0 1 True True False [Rule one, Rule two]
1 2 True True True [Rule one, Rule two, Rule three]
2 3 False False False []
3 4 False True False [Rule two]
4 5 True False True [Rule one, Rule three]
Let’s examine timings impacts on dataframe size:
from timeit import timeit
def bitflip_apply(df):
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
return df.apply(lambda row: [d[rule] for rule in df.columns if row[rule]], axis=1)
def bitflip_mul(df):
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
return df.mul(d).apply(lambda row: list(filter(None, row)), axis=1)
def scottb_dot(df):
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
dfr = df.rename(columns=d)
return dfr.dot(dfr.columns+'-').str.strip('-').str.split('-')
res = pd.DataFrame(
index=[1, 10, 30, 100, 300, 1000],
columns='bitflip_apply bitflip_mul scottb_dot'.split(),
dtype=float
)
for i in res.index:
data_dict = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'rule1': {0: True, 1: True, 2: False, 3: False, 4: True},
'rule2': {0: True, 1: True, 2: False, 3: True, 4: False},
'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}}
df = pd.DataFrame(data_dict)
df = df.set_index('id')
d = pd.concat([df]*i)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.plot(loglog=True)
Output:
Given:
df = pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'rule1': {0: True, 1: True, 2: False, 3: False, 4: True}, 'rule2': {0: True, 1: True, 2: False, 3: True, 4: False}, 'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}})
df.set_index("id", inplace=True)
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
You can use this:
df.apply(lambda row: [d[rule] for rule in df.columns if row[rule]], axis=1)
Or can use the fact that True evaluates to 1 and False to 0.
df.mul(d).apply(lambda row: list(filter(None, row)), axis=1)
Both gives you the desired output without having to deal with join and strip and split.
I have the following dataframe
id rule1 rule2 rule3
1 True True False
2 True True True
3 False False False
4 False True False
5 True False True
..
and a dictionary:
{'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
And I want to get an additional column list_of_rules
, which is a list of rules from the dictionary that are True
in the dataframe above.
id rule1 rule2 rule3 list_of_rules
1 True True False ['Rule one', 'Rule two']
2 True True True ['Rule one', 'Rule two', 'Rule three']
3 False False False ['']
4 False True False ['Rule two']
5 True False True ['Rule one', 'Rule three']
..
So far, I have the following solution:
df.loc[df['rule1'] == True, 'rule1'] = 'Rule one'
df.loc[df['rule2'] == True, 'rule2'] = 'Rule two'
df.loc[df['rule3'] == True, 'rule3'] = 'Rule three'
df.loc[df['rule1'] == False, 'rule1'] = ''
df.loc[df['rule2'] == False, 'rule2'] = ''
df.loc[df['rule3'] == False, 'rule3'] = ''
df['list_of_rules'] = df[['rule1', 'rule2', 'rule3']].apply("-".join, axis=1).str.strip('-').str.split('-')
df
which gives the following output:
id rule1 rule2 rule3 list_of_rules
1 True True False ['Rule one', 'Rule two']
2 True True True ['Rule one', 'Rule two', 'Rule three']
3 False False False ['']
4 False True False ['Rule two']
5 True False True ['Rule one', , 'Rule three']
..
Is there a way to fix fifth line, so there would be no double commas? Also, I would like to use the dictionary that I have above directly.
Thank you in advance
Try this using a little trick with pandas.Dataframe.dot
:
import pandas as pd
data_dict = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'rule1': {0: True, 1: True, 2: False, 3: False, 4: True},
'rule2': {0: True, 1: True, 2: False, 3: True, 4: False},
'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}}
df = pd.DataFrame(data_dict)
df = df.set_index('id')
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
dfr = df.rename(columns=d)
df['list_of_rules'] = dfr.dot(dfr.columns+'-').str.strip('-').str.split('-')
df.reset_index()
Output:
id rule1 rule2 rule3 list_of_rules
0 1 True True False [Rule one, Rule two]
1 2 True True True [Rule one, Rule two, Rule three]
2 3 False False False []
3 4 False True False [Rule two]
4 5 True False True [Rule one, Rule three]
Let’s examine timings impacts on dataframe size:
from timeit import timeit
def bitflip_apply(df):
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
return df.apply(lambda row: [d[rule] for rule in df.columns if row[rule]], axis=1)
def bitflip_mul(df):
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
return df.mul(d).apply(lambda row: list(filter(None, row)), axis=1)
def scottb_dot(df):
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
dfr = df.rename(columns=d)
return dfr.dot(dfr.columns+'-').str.strip('-').str.split('-')
res = pd.DataFrame(
index=[1, 10, 30, 100, 300, 1000],
columns='bitflip_apply bitflip_mul scottb_dot'.split(),
dtype=float
)
for i in res.index:
data_dict = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'rule1': {0: True, 1: True, 2: False, 3: False, 4: True},
'rule2': {0: True, 1: True, 2: False, 3: True, 4: False},
'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}}
df = pd.DataFrame(data_dict)
df = df.set_index('id')
d = pd.concat([df]*i)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.plot(loglog=True)
Output:
Given:
df = pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'rule1': {0: True, 1: True, 2: False, 3: False, 4: True}, 'rule2': {0: True, 1: True, 2: False, 3: True, 4: False}, 'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}})
df.set_index("id", inplace=True)
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
You can use this:
df.apply(lambda row: [d[rule] for rule in df.columns if row[rule]], axis=1)
Or can use the fact that True evaluates to 1 and False to 0.
df.mul(d).apply(lambda row: list(filter(None, row)), axis=1)
Both gives you the desired output without having to deal with join and strip and split.