Why does a column remain in DataFrame's index even after it is dropped
Question:
Consider the following piece of code:
>>> data = pandas.DataFrame({ 'user': [1, 5, 3, 10], 'week': [1, 1, 3, 4], 'value1': [5, 4, 3, 2], 'value2': [1, 1, 1, 2] })
>>> data = data.pivot_table(index='user', columns='week', fill_value=0)
>>> data['target'] = [True, True, False, True]
>>> data
value1 value2 target
week 1 3 4 1 3 4
user
1 5 0 0 1 0 0 True
3 0 3 0 0 1 0 True
5 4 0 0 1 0 0 False
10 0 0 2 0 0 2 True
Now if I call this:
>>> 'target' in data.columns
True
It returns True
as expected. However, why does this return True
as well?
>>> 'target' in data.drop('target', axis=1).columns
True
How can I drop a column from the table so it’s no longer in the index and the above statement returns False
?
Answers:
As of now (pandas 0.19.2), a multiindex will retain all the ever used labels in its structure. Dropping a column doesn’t remove its label from the multiindex and it is still referenced in it. See long GH item here.
Thus, you have to workaround the issue and make assumptions. If you are sure the labels you’re checking are on a specific index level (level 0 in your example), then one way is to do this:
'target' in data.drop('target', axis=1).columns.get_level_values(0)
Out[145]: False
If it can be any level, you can use get_values()
and lookup on the entire list:
import itertools as it
list(it.chain.from_iterable(data.drop('target', axis=1).columns.get_values()))
Out[150]: ['value1', 1, 'value1', 3, 'value1', 4, 'value2', 1, 'value2', 3, 'value2', 4]
I propose @Jeff’s comment as a new Answer.
data = data.drop('target', axis=1)
data.columns = data.columns.remove_unused_levels()
Consider the following piece of code:
>>> data = pandas.DataFrame({ 'user': [1, 5, 3, 10], 'week': [1, 1, 3, 4], 'value1': [5, 4, 3, 2], 'value2': [1, 1, 1, 2] })
>>> data = data.pivot_table(index='user', columns='week', fill_value=0)
>>> data['target'] = [True, True, False, True]
>>> data
value1 value2 target
week 1 3 4 1 3 4
user
1 5 0 0 1 0 0 True
3 0 3 0 0 1 0 True
5 4 0 0 1 0 0 False
10 0 0 2 0 0 2 True
Now if I call this:
>>> 'target' in data.columns
True
It returns True
as expected. However, why does this return True
as well?
>>> 'target' in data.drop('target', axis=1).columns
True
How can I drop a column from the table so it’s no longer in the index and the above statement returns False
?
As of now (pandas 0.19.2), a multiindex will retain all the ever used labels in its structure. Dropping a column doesn’t remove its label from the multiindex and it is still referenced in it. See long GH item here.
Thus, you have to workaround the issue and make assumptions. If you are sure the labels you’re checking are on a specific index level (level 0 in your example), then one way is to do this:
'target' in data.drop('target', axis=1).columns.get_level_values(0)
Out[145]: False
If it can be any level, you can use get_values()
and lookup on the entire list:
import itertools as it
list(it.chain.from_iterable(data.drop('target', axis=1).columns.get_values()))
Out[150]: ['value1', 1, 'value1', 3, 'value1', 4, 'value2', 1, 'value2', 3, 'value2', 4]
I propose @Jeff’s comment as a new Answer.
data = data.drop('target', axis=1)
data.columns = data.columns.remove_unused_levels()