Pandas: Assigning multiple *new* columns simultaneously
Question:
I have a DataFrame df
with a column containing labels for each row (in addition to some relevant data for each row). I have a dictionary labeldict
with keys equal to the possible labels and values equal to 2-tuples of information related to that label. I’d like to tack two new columns onto my frame, one for each part of the 2-tuple corresponding to the label for each row.
Here is the setup:
import pandas as pd
import numpy as np
np.random.seed(1)
n = 10
labels = list('abcdef')
colors = ['red', 'green', 'blue']
sizes = ['small', 'medium', 'large']
labeldict = {c: (np.random.choice(colors), np.random.choice(sizes)) for c in labels}
df = pd.DataFrame({'label': np.random.choice(labels, n),
'somedata': np.random.randn(n)})
I can get what I want by running:
df['color'], df['size'] = zip(*df['label'].map(labeldict))
print df
label somedata color size
0 b 0.196643 red medium
1 c -1.545214 green small
2 a -0.088104 green small
3 c 0.852239 green small
4 b 0.677234 red medium
5 c -0.106878 green small
6 a 0.725274 green small
7 d 0.934889 red medium
8 a 1.118297 green small
9 c 0.055613 green small
But how can I do this if I don’t want to manually type out the two columns on the left side of the assignment? I.e. how can I create multiple new columns on the fly. For example, if I had 10-tuples in labeldict
instead of 2-tuples, this would be a real pain as currently written. Here are a couple things that don’t work:
# set up attrlist for later use
attrlist = ['color', 'size']
# non-working idea 1)
df[attrlist] = zip(*df['label'].map(labeldict))
# non-working idea 2)
df.loc[:, attrlist] = zip(*df['label'].map(labeldict))
This does work, but seems like a hack:
for a in attrlist:
df[a] = 0
df[attrlist] = zip(*df['label'].map(labeldict))
Better solutions?
Answers:
Instead of doing what you’re doing with labeldict, you could make that information into a DataFrame and then join it with your original one:
>>> labeldf = pandas.DataFrame([(np.random.choice(colors), np.random.choice(sizes)) for c in labels], columns=['color', 'size'], index=labels)
>>> df.join(labeldf, on='label')
label somedata color size
0 a -1.709973 red medium
1 b 0.099109 blue medium
2 a -0.427323 red medium
3 b 0.474995 blue medium
4 b -2.819208 blue medium
5 d -0.998888 red small
6 b 0.713357 blue medium
7 d 0.331989 red small
8 e -0.906240 green large
9 c -0.501916 blue large
You can use merge instead:
>>> ld = pd.DataFrame(labeldict).T
>>> ld.columns = ['color', 'size']
>>> ld.index.name = 'label'
>>> df.merge(ld.reset_index(), on='label')
label somedata color size
0 b 1.462108 red medium
1 c -2.060141 green small
2 c 1.133769 green small
3 c 0.042214 green small
4 e -0.322417 red medium
5 e -1.099891 red medium
6 e -0.877858 red medium
7 e 0.582815 red medium
8 f -0.384054 red large
9 d -0.172428 red medium
If you want to add multiple columns to a DataFrame
as part of a method chain, you can use apply
. The first step is to create a function that will transform a row represented as a Series
into the form you want. Then you can call apply
to use this function on each row.
def append_label_attributes(row: pd.Series, labelmap: dict) -> pd.Series:
result = row.copy()
result['color'] = labelmap[result['label']][0]
result['size'] = labelmap[result['label']][1]
return result
df = (
pd.DataFrame(
{
'label': np.random.choice(labels, n),
'somedata': np.random.randn(n)}
)
.apply(append_label_attributes, axis='columns', labelmap=labeldict)
)
Just use result_type='expand'
in pandas apply
df
Out[78]:
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')
df
Out[80]:
a b mean std max
0 0 1 0.5 0.5 1.0
1 2 3 2.5 0.5 3.0
2 4 5 4.5 0.5 5.0
3 6 7 6.5 0.5 7.0
4 8 9 8.5 0.5 9.0
and here some copy paste code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['a','b'])
print('df',df, sep='n')
print()
def mathOperationsTuple(arr):
return np.mean(arr), np.std(arr), np.amax(arr)
df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')
print('df',df, sep='n')
This should work:
df[['color','size']] = list(df['label'].apply(labeldict))
I have a DataFrame df
with a column containing labels for each row (in addition to some relevant data for each row). I have a dictionary labeldict
with keys equal to the possible labels and values equal to 2-tuples of information related to that label. I’d like to tack two new columns onto my frame, one for each part of the 2-tuple corresponding to the label for each row.
Here is the setup:
import pandas as pd
import numpy as np
np.random.seed(1)
n = 10
labels = list('abcdef')
colors = ['red', 'green', 'blue']
sizes = ['small', 'medium', 'large']
labeldict = {c: (np.random.choice(colors), np.random.choice(sizes)) for c in labels}
df = pd.DataFrame({'label': np.random.choice(labels, n),
'somedata': np.random.randn(n)})
I can get what I want by running:
df['color'], df['size'] = zip(*df['label'].map(labeldict))
print df
label somedata color size
0 b 0.196643 red medium
1 c -1.545214 green small
2 a -0.088104 green small
3 c 0.852239 green small
4 b 0.677234 red medium
5 c -0.106878 green small
6 a 0.725274 green small
7 d 0.934889 red medium
8 a 1.118297 green small
9 c 0.055613 green small
But how can I do this if I don’t want to manually type out the two columns on the left side of the assignment? I.e. how can I create multiple new columns on the fly. For example, if I had 10-tuples in labeldict
instead of 2-tuples, this would be a real pain as currently written. Here are a couple things that don’t work:
# set up attrlist for later use
attrlist = ['color', 'size']
# non-working idea 1)
df[attrlist] = zip(*df['label'].map(labeldict))
# non-working idea 2)
df.loc[:, attrlist] = zip(*df['label'].map(labeldict))
This does work, but seems like a hack:
for a in attrlist:
df[a] = 0
df[attrlist] = zip(*df['label'].map(labeldict))
Better solutions?
Instead of doing what you’re doing with labeldict, you could make that information into a DataFrame and then join it with your original one:
>>> labeldf = pandas.DataFrame([(np.random.choice(colors), np.random.choice(sizes)) for c in labels], columns=['color', 'size'], index=labels)
>>> df.join(labeldf, on='label')
label somedata color size
0 a -1.709973 red medium
1 b 0.099109 blue medium
2 a -0.427323 red medium
3 b 0.474995 blue medium
4 b -2.819208 blue medium
5 d -0.998888 red small
6 b 0.713357 blue medium
7 d 0.331989 red small
8 e -0.906240 green large
9 c -0.501916 blue large
You can use merge instead:
>>> ld = pd.DataFrame(labeldict).T
>>> ld.columns = ['color', 'size']
>>> ld.index.name = 'label'
>>> df.merge(ld.reset_index(), on='label')
label somedata color size
0 b 1.462108 red medium
1 c -2.060141 green small
2 c 1.133769 green small
3 c 0.042214 green small
4 e -0.322417 red medium
5 e -1.099891 red medium
6 e -0.877858 red medium
7 e 0.582815 red medium
8 f -0.384054 red large
9 d -0.172428 red medium
If you want to add multiple columns to a DataFrame
as part of a method chain, you can use apply
. The first step is to create a function that will transform a row represented as a Series
into the form you want. Then you can call apply
to use this function on each row.
def append_label_attributes(row: pd.Series, labelmap: dict) -> pd.Series:
result = row.copy()
result['color'] = labelmap[result['label']][0]
result['size'] = labelmap[result['label']][1]
return result
df = (
pd.DataFrame(
{
'label': np.random.choice(labels, n),
'somedata': np.random.randn(n)}
)
.apply(append_label_attributes, axis='columns', labelmap=labeldict)
)
Just use result_type='expand'
in pandas apply
df
Out[78]:
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')
df
Out[80]:
a b mean std max
0 0 1 0.5 0.5 1.0
1 2 3 2.5 0.5 3.0
2 4 5 4.5 0.5 5.0
3 6 7 6.5 0.5 7.0
4 8 9 8.5 0.5 9.0
and here some copy paste code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['a','b'])
print('df',df, sep='n')
print()
def mathOperationsTuple(arr):
return np.mean(arr), np.std(arr), np.amax(arr)
df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')
print('df',df, sep='n')
This should work:
df[['color','size']] = list(df['label'].apply(labeldict))