How to insert a list in a dataframe pandas

Question:

I have the following dataframe

id  rule1 rule2 rule3
1   True  True  False
2   True  True  True
3   False False False
4   False True  False
5   True  False True
..

and a dictionary:

{'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}

And I want to get an additional column list_of_rules, which is a list of rules from the dictionary that are True in the dataframe above.

id  rule1 rule2 rule3  list_of_rules
1   True  True  False  ['Rule one', 'Rule two']
2   True  True  True   ['Rule one', 'Rule two', 'Rule three']
3   False False False  ['']
4   False True  False  ['Rule two']
5   True  False True   ['Rule one', 'Rule three']
..

So far, I have the following solution:

df.loc[df['rule1'] == True, 'rule1'] = 'Rule one'
df.loc[df['rule2'] == True, 'rule2'] = 'Rule two'
df.loc[df['rule3'] == True, 'rule3'] = 'Rule three'

df.loc[df['rule1'] == False, 'rule1'] = ''
df.loc[df['rule2'] == False, 'rule2'] = ''
df.loc[df['rule3'] == False, 'rule3'] = ''

df['list_of_rules'] = df[['rule1', 'rule2', 'rule3']].apply("-".join, axis=1).str.strip('-').str.split('-')

df

which gives the following output:

id  rule1 rule2 rule3  list_of_rules
1   True  True  False  ['Rule one', 'Rule two']
2   True  True  True   ['Rule one', 'Rule two', 'Rule three']
3   False False False  ['']
4   False True  False  ['Rule two']
5   True  False True   ['Rule one', , 'Rule three']
..

Is there a way to fix fifth line, so there would be no double commas? Also, I would like to use the dictionary that I have above directly.

Thank you in advance

Asked By: Matas M

||

Answers:

Try this using a little trick with pandas.Dataframe.dot:

import pandas as pd

data_dict = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
             'rule1': {0: True, 1: True, 2: False, 3: False, 4: True},
             'rule2': {0: True, 1: True, 2: False, 3: True, 4: False},
             'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}}
df = pd.DataFrame(data_dict)
df = df.set_index('id')
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
dfr = df.rename(columns=d)
df['list_of_rules'] = dfr.dot(dfr.columns+'-').str.strip('-').str.split('-')
df.reset_index()

Output:

   id  rule1  rule2  rule3                     list_of_rules
0   1   True   True  False              [Rule one, Rule two]
1   2   True   True   True  [Rule one, Rule two, Rule three]
2   3  False  False  False                                []
3   4  False   True  False                        [Rule two]
4   5   True  False   True            [Rule one, Rule three]

Let’s examine timings impacts on dataframe size:

from timeit import timeit

def bitflip_apply(df):
    d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
    return df.apply(lambda row: [d[rule] for rule in df.columns if row[rule]], axis=1)

def bitflip_mul(df):
    d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
    return df.mul(d).apply(lambda row: list(filter(None, row)), axis=1)

def scottb_dot(df):
    d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}
    dfr = df.rename(columns=d)
    return dfr.dot(dfr.columns+'-').str.strip('-').str.split('-')


res = pd.DataFrame(
    index=[1, 10, 30, 100, 300, 1000],
    columns='bitflip_apply bitflip_mul scottb_dot'.split(),
    dtype=float
)

for i in res.index:
    data_dict = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
             'rule1': {0: True, 1: True, 2: False, 3: False, 4: True},
             'rule2': {0: True, 1: True, 2: False, 3: True, 4: False},
             'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}}
    df = pd.DataFrame(data_dict)
    df = df.set_index('id')
    d = pd.concat([df]*i)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.plot(loglog=True)

Output:

enter image description here

Answered By: Scott Boston

Given:

df = pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'rule1': {0: True, 1: True, 2: False, 3: False, 4: True}, 'rule2': {0: True, 1: True, 2: False, 3: True, 4: False}, 'rule3': {0: False, 1: True, 2: False, 3: False, 4: True}})
df.set_index("id", inplace=True)
d = {'rule1': 'Rule one', 'rule2': 'Rule two', 'rule3': 'Rule three'}

You can use this:

df.apply(lambda row: [d[rule] for rule in df.columns if row[rule]], axis=1)

Or can use the fact that True evaluates to 1 and False to 0.

df.mul(d).apply(lambda row: list(filter(None, row)), axis=1)

Both gives you the desired output without having to deal with join and strip and split.

Answered By: bitflip
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.