python pandas remove duplicate columns

Question:

What is the easiest way to remove duplicate columns from a dataframe?

I am reading a text file that has duplicate columns via:

import pandas as pd

df=pd.read_table(fname)

The column names are:

Time, Time Relative, N2, Time, Time Relative, H2, etc...

All the Time and Time Relative columns contain the same data. I want:

Time, Time Relative, N2, H2

All my attempts at dropping, deleting, etc such as:

df=df.T.drop_duplicates().T

Result in uniquely valued index errors:

Reindexing only valid with uniquely valued index objects

Sorry for being a Pandas noob. Any Suggestions would be appreciated.


Additional Details

Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)

data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):

Time    Time Relative [s]    N2[%]    Time    Time Relative [s]    H2[ppm]
2/12/2013 9:20:55 AM    6.177    9.99268e+001    2/12/2013 9:20:55 AM    6.177    3.216293e-005    
2/12/2013 9:21:06 AM    17.689    9.99296e+001    2/12/2013 9:21:06 AM    17.689    3.841667e-005    
2/12/2013 9:21:18 AM    29.186    9.992954e+001    2/12/2013 9:21:18 AM    29.186    3.880365e-005    
... etc ...
2/12/2013 2:12:44 PM    17515.269    9.991756+001    2/12/2013 2:12:44 PM    17515.269    2.800279e-005    
2/12/2013 2:12:55 PM    17526.769    9.991754e+001    2/12/2013 2:12:55 PM    17526.769    2.880386e-005
2/12/2013 2:13:07 PM    17538.273    9.991797e+001    2/12/2013 2:13:07 PM    17538.273    3.131447e-005
Asked By: Onlyjus

||

Answers:

It sounds like you already know the unique column names. If that’s the case, then df = df['Time', 'Time Relative', 'N2'] would work.

If not, your solution should work:

In [101]: vals = np.random.randint(0,20, (4,3))
          vals
Out[101]:
array([[ 3, 13,  0],
       [ 1, 15, 14],
       [14, 19, 14],
       [19,  5,  1]])

In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
          df
Out[106]:
   Time  H1  N2  Time Relative  N2  Time
0     3  13   0              3  13     0
1     1  15  14              1  15    14
2    14  19  14             14  19    14
3    19   5   1             19   5     1

In [107]: df.T.drop_duplicates().T
Out[107]:
   Time  H1  N2
0     3  13   0
1     1  15  14
2    14  19  14
3    19   5   1

You probably have something specific to your data that’s messing it up. We could give more help if there’s more details you could give us about the data.

Edit:
Like Andy said, the problem is probably with the duplicate column titles.

For a sample table file ‘dummy.csv’ I made up:

Time    H1  N2  Time    N2  Time Relative
3   13  13  3   13  0
1   15  15  1   15  14
14  19  19  14  19  14
19  5   5   19  5   1

using read_table gives unique columns and works properly:

In [151]: df2 = pd.read_table('dummy.csv')
          df2
Out[151]:
         Time  H1  N2  Time.1  N2.1  Time Relative
      0     3  13  13       3    13              0
      1     1  15  15       1    15             14
      2    14  19  19      14    19             14
      3    19   5   5      19     5              1
In [152]: df2.T.drop_duplicates().T
Out[152]:
             Time  H1  Time Relative
          0     3  13              0
          1     1  15             14
          2    14  19             14
          3    19   5              1  

If your version doesn’t let your, you can hack together a solution to make them unique:

In [169]: df2 = pd.read_table('dummy.csv', header=None)
          df2
Out[169]:
              0   1   2     3   4              5
        0  Time  H1  N2  Time  N2  Time Relative
        1     3  13  13     3  13              0
        2     1  15  15     1  15             14
        3    14  19  19    14  19             14
        4    19   5   5    19   5              1
In [171]: from collections import defaultdict
          col_counts = defaultdict(int)
          col_ix = df2.first_valid_index()
In [172]: cols = []
          for col in df2.ix[col_ix]:
              cnt = col_counts[col]
              col_counts[col] += 1
              suf = '_' + str(cnt) if cnt else ''
              cols.append(col + suf)
          cols
Out[172]:
          ['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
          df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
          Time  H1  N2 Time_1 N2_1 Time Relative
        1    3  13  13      3   13             0
        2    1  15  15      1   15            14
        3   14  19  19     14   19            14
        4   19   5   5     19    5             1
In [178]: df2.T.drop_duplicates().T
Out[178]:
          Time  H1 Time Relative
        1    3  13             0
        2    1  15            14
        3   14  19            14
        4   19   5             1 
Answered By: beardc

Transposing is inefficient for large DataFrames. Here is an alternative:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups       

Use it like this:

dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)

Edit

A memory efficient version that treats nans like any other value:

from pandas.core.common import array_equivalent

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups
Answered By: kalu

If I’m not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu ‘s function, keeping the first of any similarly named columns.

Cols = list(df.columns)
for i,item in enumerate(df.columns):
    if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
Answered By: Elliott Collins

Here’s a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()].copy()

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

Note: the above only checks columns names, not column values.

To remove duplicated indexes

Since it is similar enough, do the same thing on the index:

df = df.loc[~df.index.duplicated(),:].copy()

To remove duplicates by checking values without transposing

Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()

This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:

# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312))) 


#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs

# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]

# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
Answered By: Gene Burinsky

First step:- Read first row i.e all columns the remove all duplicate columns.

Second step:- Finally read only that columns.

cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
Answered By: kamran kausar

I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.

The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.

– E

March 2021 update

The subsequent post by @CircArgs may have provided a succinct one-liner to accomplish what I described here.

Answered By: Edmund's Echo

It looks like you were on the right path. Here is the one-liner you were looking for:

df.reset_index().T.drop_duplicates().T

But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:

original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
Answered By: Tony B

The way below will identify dupe columns to review what is going wrong building the dataframe originally.

dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Answered By: Joe

Fast and easy way to drop the duplicated columns by their values:

df = df.T.drop_duplicates().T

More info: Pandas DataFrame drop_duplicates manual .

Answered By: jaqm

Note that Gene Burinsky’s answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:

df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
Answered By: CircArgs

An update on @kalu’s answer, which uses the latest pandas:

def find_duplicated_columns(df):
    dupes = []

    columns = df.columns

    for i in range(len(columns)):
        col1 = df.iloc[:, i]
        for j in range(i + 1, len(columns)):
            col2 = df.iloc[:, j]
            # break early if dtypes aren't the same (helps deal with
            # categorical dtypes)
            if col1.dtype is not col2.dtype:
                break
            # otherwise compare values
            if col1.equals(col2):
                dupes.append(columns[i])
                break

    return dupes
Answered By: RoyalTS

Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:

def get_dup_columns(m):
    '''
    This will check every column in data frame 
    and verify if you have duplicated columns.
    can help whenever you are cleaning big data sets of 50+ columns 
    and clean up a little  bit for you
    The result will be a list of tuples showing what columns are duplicates
    for example
    (column A, Column C)
    That means that column A is duplicated with column C
    more info go to https://wanatux.com
    '''
    headers_list = [x for x in m.columns]
    duplicate_col2 = []
    y = 0
    while y <= len(headers_list)-1:
        for x in range(1,len(headers_list)-1):
            if m[headers_list[y]].equals(m[headers_list[x]]) == False:        
                continue
            else:
                duplicate_col2.append((headers_list[y],headers_list[x]))
        headers_list.pop(0)  
    return duplicate_col2

And you can cast the definition like this:

duplicate_col = get_dup_columns(pd_excel)

It will show a result like the following:

 [('column a', 'column k'),
 ('column a', 'column r'),
 ('column h', 'column m'),
 ('column k', 'column r')]
Answered By: Baffometo

Although @Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:

duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
Answered By: emso

I am not sure why Gene Burinsky’s answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.

df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
Answered By: Matheus Araujo

A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:

import numpy as np
import pandas as pd
from itertools import combinations as combi

df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy()  # column 'a' is equal to column 'd'

# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]

# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
            
df = df.drop(columns=dupli_cols)
Answered By: Marco Cerliani

In case you want to check for duplicate columns, this code can be useful

columns_to_drop= []

for cname in sorted(list(df)):
    for cname2 in sorted(list(df))[::-1]:
        if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
            columns_to_drop.append(cname2)
            print(cname,cname2,'Are equal')

df = df.drop(columns_to_drop, axis=1)
Answered By: Catruc Iurie
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.