Pandas set_levels on MultiIndex: Level values must be unique
Question:
Given a DataFrame df
Value
Category Pool Class
A 1.0 1.0 1
9.0 2
B 1.0 1.0 3
C 1.0 1.0 4
5.0 5
I want to convert the levels Pool
and Class
to integers without reset_index
(see below).
I tried using a combination of get_level_values
and set_levels
like so
for c in ['Pool', 'Class']:
df.index.set_levels(df.index.get_level_values(c).astype(int), level=c, inplace=True)
However, this raises
ValueError: Level values must be unique: [1, 1, 1, 1, 1] on level 1
To understand what happens, I also tried using verify_integrity=False
. Then
df.index.set_levels(df.index.get_level_values('Class').astype(int),
level='Class', verify_integrity=False, inplace=True)
produces
Value
Category Pool Class
A 1.0 1 1
1 2
B 1.0 1 3
C 1.0 1 4
9 5
whereas my goal is to obtain
Value
Category Pool Class
A 1.0 1 1
9 2
B 1.0 1 3
C 1.0 1 4
5 5
How to achieve this properly? Is chaining of get_level_values
and set_levels
the correct way to do it? Why is pandas
not able to properly set the level after having it transformed with astype
?
I guess you could work with reset_index
and set_index
but what is the benefit then of having the methods set_levels
?
d = {'Category': str, 'Pool': int, 'Class': int}
df.reset_index(drop=False, inplace=True)
for k, v in d.items():
df[k] = df[k].astype(v)
df.set_index(list(d.keys()), inplace=True)
Answers:
You can access index levels directly via pd.MultiIndex.levels
and feed to pd.MultiIndex.set_levels
:
df.index = df.index.set_levels(df.index.levels[2].astype(int), level=2)
print(df)
Value
Category Pool Class
A 1.0 1 1
9 2
B 1.0 1 3
C 1.0 1 4
5 5
The following function can be used as a complement to get_level_values
:
def set_level_values(midx, level, values):
full_levels = list(zip(*midx.values))
names = midx.names
if isinstance(level, str):
if level not in names:
raise ValueError(f'No level {level} in MultiIndex')
level = names.index(level)
if len(full_levels[level]) != len(values):
raise ValueError('Values must be of the same size as original level')
full_levels[level] = values
return pd.MultiIndex.from_arrays(full_levels, names=names)
Using this function, the solution for the original question would be:
for c in ['Pool', 'Class']:
df.index = set_level_values(df.index, level=c, values=df.index.get_level_values(c).astype(int))
To get the integer position that corresponds to a level name stored in variable k
, you can use:
df.index.names.index(k)
So if, like OP, you have a dict
of level names and types, simply do:
levels = [df.index.levels[df.index.names.index(k)].astype(v)
for k, v in d.items()]
df.index = df.index.set_levels(levels=levels, level=d.keys())
Or, the same thing in a method chain:
df.set_index(
df.index.set_levels(
[df.index.levels[df.index.names.index(k)].astype(v)
for k, v in d.items()],
level=d.keys())
)...
Setup for OP’s DataFrame
and dict
:
import pandas as pd
df = pd.DataFrame(
range(1, 6),
index=pd.MultiIndex.from_tuples(
[
('A', 1., 1.),
('A', 1., 9.),
('B', 1., 1.),
('C', 1., 1.),
('C', 1., 5.)
],
names=['Category', 'Pool', 'Class']
),
columns=['Value']
)
d = {'Category': str, 'Pool': int, 'Class': int}
Given a DataFrame df
Value
Category Pool Class
A 1.0 1.0 1
9.0 2
B 1.0 1.0 3
C 1.0 1.0 4
5.0 5
I want to convert the levels Pool
and Class
to integers without reset_index
(see below).
I tried using a combination of get_level_values
and set_levels
like so
for c in ['Pool', 'Class']:
df.index.set_levels(df.index.get_level_values(c).astype(int), level=c, inplace=True)
However, this raises
ValueError: Level values must be unique: [1, 1, 1, 1, 1] on level 1
To understand what happens, I also tried using verify_integrity=False
. Then
df.index.set_levels(df.index.get_level_values('Class').astype(int),
level='Class', verify_integrity=False, inplace=True)
produces
Value
Category Pool Class
A 1.0 1 1
1 2
B 1.0 1 3
C 1.0 1 4
9 5
whereas my goal is to obtain
Value
Category Pool Class
A 1.0 1 1
9 2
B 1.0 1 3
C 1.0 1 4
5 5
How to achieve this properly? Is chaining of get_level_values
and set_levels
the correct way to do it? Why is pandas
not able to properly set the level after having it transformed with astype
?
I guess you could work with reset_index
and set_index
but what is the benefit then of having the methods set_levels
?
d = {'Category': str, 'Pool': int, 'Class': int}
df.reset_index(drop=False, inplace=True)
for k, v in d.items():
df[k] = df[k].astype(v)
df.set_index(list(d.keys()), inplace=True)
You can access index levels directly via pd.MultiIndex.levels
and feed to pd.MultiIndex.set_levels
:
df.index = df.index.set_levels(df.index.levels[2].astype(int), level=2)
print(df)
Value
Category Pool Class
A 1.0 1 1
9 2
B 1.0 1 3
C 1.0 1 4
5 5
The following function can be used as a complement to get_level_values
:
def set_level_values(midx, level, values):
full_levels = list(zip(*midx.values))
names = midx.names
if isinstance(level, str):
if level not in names:
raise ValueError(f'No level {level} in MultiIndex')
level = names.index(level)
if len(full_levels[level]) != len(values):
raise ValueError('Values must be of the same size as original level')
full_levels[level] = values
return pd.MultiIndex.from_arrays(full_levels, names=names)
Using this function, the solution for the original question would be:
for c in ['Pool', 'Class']:
df.index = set_level_values(df.index, level=c, values=df.index.get_level_values(c).astype(int))
To get the integer position that corresponds to a level name stored in variable k
, you can use:
df.index.names.index(k)
So if, like OP, you have a dict
of level names and types, simply do:
levels = [df.index.levels[df.index.names.index(k)].astype(v)
for k, v in d.items()]
df.index = df.index.set_levels(levels=levels, level=d.keys())
Or, the same thing in a method chain:
df.set_index(
df.index.set_levels(
[df.index.levels[df.index.names.index(k)].astype(v)
for k, v in d.items()],
level=d.keys())
)...
Setup for OP’s DataFrame
and dict
:
import pandas as pd
df = pd.DataFrame(
range(1, 6),
index=pd.MultiIndex.from_tuples(
[
('A', 1., 1.),
('A', 1., 9.),
('B', 1., 1.),
('C', 1., 1.),
('C', 1., 5.)
],
names=['Category', 'Pool', 'Class']
),
columns=['Value']
)
d = {'Category': str, 'Pool': int, 'Class': int}