What is the best way to remove columns in pandas

Question:

I am raising this question for my self learning. As far as I know, followings are the different methods to remove columns in pandas dataframe.

Option – 1:

df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
del df['a']

Option – 2:

df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
df=df.drop('a',1)

Option – 3:

df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
df=df[['b','c']]
  1. What is the best approach among these?
  2. Any other approaches to achieve the same?
Asked By: Mohamed Thasin ah

||

Answers:

In my opinion the best is use 2. and 3. option, because first has limits – you can remove only one column and cannot use dot notationdel df.a.

3.solution is not deleting, but selecting and piRSquared create nice answer for multiple possible solutions with same idea.

Answered By: jezrael

Follow the doc:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

And pandas.DataFrame.drop:

Drop specified labels from rows or columns.

So, I think we should stick with df.drop. Why? I think the pros are:

  1. It gives us more control of the remove action:

    # This will return a NEW DataFrame object, leave the original `df` untouched.
    df.drop('a', axis=1)  
    # This will modify the `df` inplace. **And return a `None`**.
    df.drop('a', axis=1, inplace=True)  
    
  2. It can handle more complicated cases with it’s args. E.g. with level, we can handle MultiIndex deletion. And with errors, we can prevent some bugs.

  3. It’s a more unified and object oriented way.


And just like @jezrael noted in his answer:

Option 1: Using key word del is a limited way.

Option 3: And df=df[['b','c']] isn’t even a deletion in essence. It first select data by indexing with [] syntax, then unbind the name df with the original DataFrame and bind it with the new one (i.e. df[['b','c']]).

Answered By: YaOzI

The recommended way to delete a column or row in pandas dataframes is using drop.

To delete a column,

df.drop('column_name', axis=1, inplace=True)

To delete a row,

df.drop('row_index', axis=0, inplace=True)

You can refer this post to see a detailed conversation about column delete approaches.

Answered By: razmik

From a speed perspective, option 1 seems to be the best. Obviously, based on the other answers, that doesn’t mean it’s actually the best option.

In [52]: import timeit

In [53]: s1 = """
    ...: import pandas as pd
    ...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
    ...: del df['a']
    ...: """

In [54]: s2 = """
    ...: import pandas as pd
    ...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
    ...: df=df.drop('a',1)
    ...: """

In [55]: s3 = """
    ...: import pandas as pd
    ...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
    ...: df=df[['b','c']]
    ...: """

In [56]: timeit.timeit(stmt=s1, number=100000)
Out[56]: 53.37321400642395

In [57]: timeit.timeit(stmt=s2, number=100000)
Out[57]: 79.68139410018921

In [58]: timeit.timeit(stmt=s3, number=100000)
Out[58]: 76.25269913673401
Answered By: aydow
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.