# Remap values in pandas column with a dict, preserve NaNs

## Question:

I have a dictionary which looks like this: `di = {1: "A", 2: "B"}`

I would like to apply it to the `col1`

column of a dataframe similar to:

```
col1 col2
0 w a
1 1 2
2 2 NaN
```

to get:

```
col1 col2
0 w a
1 A 2
2 B NaN
```

How can I best do this?

## Answers:

There is a bit of ambiguity in your question. There are at least ~~three~~ two interpretations:

- the keys in
`di`

refer to index values - the keys in
`di`

refer to`df['col1']`

values - the keys in
`di`

refer to index locations (not the OP’s question, but thrown in for fun.)

Below is a solution for each case.

**Case 1:**

If the keys of `di`

are meant to refer to index values, then you could use the `update`

method:

```
df['col1'].update(pd.Series(di))
```

For example,

```
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
```

yields

```
col1 col2
1 w a
2 B 30
0 A NaN
```

I’ve modified the values from your original post so it is clearer what `update`

is doing.

Note how the keys in `di`

are associated with index values. The order of the index values — that is, the index *locations* — does not matter.

**Case 2:**

If the keys in `di`

refer to `df['col1']`

values, then @DanAllan and @DSM show how to achieve this with `replace`

:

```
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
```

yields

```
col1 col2
1 w a
2 A 30
0 B NaN
```

Note how in this case the keys in `di`

were changed to match *values* in `df['col1']`

.

**Case 3:**

If the keys in `di`

refer to index locations, then you could use

```
df['col1'].put(di.keys(), di.values())
```

since

```
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
```

yields

```
col1 col2
1 A a
2 10 30
0 B NaN
```

Here, the first and third rows were altered, because the keys in `di`

are `0`

and `2`

, which with Python’s 0-based indexing refer to the first and third locations.

You can use `.replace`

. For example:

```
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
```

or directly on the `Series`

, i.e. `df["col1"].replace(di, inplace=True)`

.

`map`

can be much faster than `replace`

If your dictionary has more than a couple of keys, using `map`

can be much faster than `replace`

. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):

### Exhaustive Mapping

In this case, the form is very simple:

```
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
```

Although `map`

most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map

### Non-Exhaustive Mapping

If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add `fillna`

:

```
df['col1'].map(di).fillna(df['col1'])
```

as in @jpp’s answer here: Replace values in a pandas series via dictionary efficiently

### Benchmarks

Using the following data with pandas version 0.23.1:

```
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
```

and testing with `%timeit`

, it appears that `map`

is approximately 10x faster than `replace`

.

Note that your speedup with `map`

will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.

Adding to this question if you ever have more than one columns to remap in a data dataframe:

```
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
```

Hope it can be useful to someone.

Cheers

A more native pandas approach is to apply a replace function as below:

```
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
```

Once you defined the function, you can apply it to your dataframe.

```
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)
```

DSM has the accepted answer, but the coding doesn’t seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):

```
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
```

You’ll see it looks like:

```
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
```

The docs for pandas.DataFrame.replace are here.

Or do `apply`

:

```
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
```

Demo:

```
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
```

A nice complete solution that keeps a map of your class labels:

```
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
```

This way, you can at any point refer to the original class label from labels_dict.

As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:

```
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
```

The `.transform()`

processes each column as a series. Contrary to `.apply()`

which passes the columns aggregated in a DataFrame.

Consequently you can apply the Series method `map()`

.

Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.

The `.get(x,x)`

accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the `.map()`

method

Given `map`

is faster than replace (@JohnE’s solution) you need to be careful **with Non-Exhaustive mappings where you intend to map specific values to NaN**. The proper method in this case requires that you

`mask`

the Series when you `.fillna`

, else you undo the mapping to `NaN`

.```
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
```

```
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
```

```
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
```

You can update your mapping dictionary with missing pairs from the dataframe. For example:

```
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', np.nan]})
map_ = {'a': 'A', 'b': 'B', 'd': np.nan}
# Get mapping from df
uniques = df['col1'].unique()
map_new = dict(zip(uniques, uniques))
# {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', nan: nan}
# Update mapping
map_new.update(map_)
# {'a': 'A', 'b': 'B', 'c': 'c', 'd': nan, nan: nan}
df['col2'] = df['col1'].map(dct_map_new)
```

Result:

```
col1 col2
0 a A
1 b B
2 c c
3 d NaN
4 NaN NaN
```

## TL;DR: Use `map`

+`fillna`

for large `di`

and use `replace`

for small `di`

#### 1. An alternative: `np.select()`

If the remapping dictionary is not too large, another option is `numpy.select`

. The syntax of `np.select`

requires separate arrays/lists of conditions and replacement values, so the keys and values of `di`

should be separated.

```
import numpy as np
df['col1'] = np.select((df[['col1']].values == list(di)).T, di.values(), df['col1'])
```

N.B. If the remapping dictionary `di`

is very large, this may run into memory issues because as you can see from the line of code above, a boolean array of shape `(len(df), len(di))`

is required to evaluate the conditions.

#### 2. `map`

+`fillna`

vs `replace`

. Which is better?

If we look at the source code, if a dictionary is passed to it, `map`

is an optimized method that calls a Cython-optimized `take_nd()`

function to make replacements and `fillna()`

calls `where()`

(another optimized method) to fill values. On the other hand, `replace()`

is implemented in Python and uses a loop over the dictionary. So if the dictionary is large, `replace`

can potentially be *thousands of times* slower than `map`

+`fillna`

. Let’s illustrate the difference by the following example where a single value (`0`

) is replaced in the column (one using a dictionary of length 1000 (`di1`

) and another using a dictionary of length 1 (`di2`

)).

```
df = pd.DataFrame({'col1': range(1000)})
di1 = {k: k+1 for k in range(-1000, 1)}
di2 = {0: 1}
%timeit df['col1'].map(di1).fillna(df['col1'])
# 1.19 ms ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df['col1'].replace(di1)
# 41.4 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['col1'].map(di2).fillna(df['col1'])
# 691 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df['col1'].replace(di2)
# 157 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

As you can see, if `len(di)==1000`

, `replace`

is 35 times slower, but if `len(di)==1`

, it’s 4.5 times faster. This gap gets worse as the size of the remapping dictionary `di`

increases.

In fact, if we look at the performance plots, we can make the following observations. The plots were drawn with particular parameters fixed in each graph. You can use the code down below to change the size of the dataframe to see for different parameters but it will produce very similar plots.

- For a given dataframe,
`map`

+`fillna`

makes replacements in almost constant time regardless of the size of the remapping dictionary whereas`replace`

does worse as the size of the remapping dictionary increases (top-left plot). - The percentage of values being replaced in the dataframe has very little impact on the runtime difference. The impact of the length of
`di`

completely trumps whatever impact it has (top-right plot). - For a given remapping dictionary,
`map`

+`fillna`

performs better than`replace`

as the size of the dataframe increases (bottom-left plot). - Again, if
`di`

is large, the size of the dataframe does not matter;`map`

+`fillna`

is much faster than`replace`

(bottom-right plot).

Code used to produce the plots:

```
import numpy as np
import pandas as pd
from perfplot import plot
import matplotlib.pyplot as plt
kernels = [lambda df,di: df['col1'].replace(di),
lambda df,di: df['col1'].map(di).fillna(df['col1'])]
labels = ["replace", "map+fillna"]
# first plot
N, m = 100000, 20
plot(
setup=lambda n: (pd.DataFrame({'col1': np.resize(np.arange(m*n), N)}),
{k: (k+1)/2 for k in range(n)}),
kernels=kernels, labels=labels,
n_range=range(1, 21),
xlabel='Length of replacement dictionary',
title=f'Remapping values in a column (len(df)={N:,}, {100//m}% replaced)',
equality_check=pd.Series.equals)
_, xmax = plt.xlim()
plt.xlim((0.5, xmax+1))
plt.xticks(np.arange(1, xmax+1, 2));
# second plot
N, m = 100000, 1000
di = {k: (k+1)/2 for k in range(m)}
plot(
setup=lambda n: pd.DataFrame({'col1': np.resize(np.arange((n-100)*m//100, n*m//100), N)}),
kernels=kernels, labels=labels,
n_range=[1, 5, 10, 15, 25, 40, 55, 75, 100],
xlabel='Percentage of values replaced',
title=f'Remapping values in a column (len(df)={N:,}, len(di)={m})',
equality_check=pd.Series.equals);
# third plot
m, n = 10, 0.01
di = {k: (k+1)/2 for k in range(m)}
plot(
setup=lambda N: pd.DataFrame({'col1': np.resize(np.arange((n-1)*m, n*m), N)}),
kernels=kernels, labels=labels,
n_range=[2**k for k in range(6, 21)],
xlabel='Length of dataframe',
logy=False,
title=f'Remapping values in a column (len(di)={m}, {int(n*100)}% replaced)',
equality_check=pd.Series.equals);
# fourth plot
m, n = 100, 0.01
di = {k: (k+1)/2 for k in range(m)}
plot(
setup=lambda N: pd.DataFrame({'col1': np.resize(np.arange((n-1)*m, n*m), N)}),
kernels=kernels, labels=labels,
n_range=[2**k for k in range(6, 21)],
xlabel='Length of dataframe',
title=f'Remapping values in a column (len(di)={m}, {int(n*100)}% replaced)',
equality_check=pd.Series.equals);
```