python pandas remove duplicate columns
Question:
What is the easiest way to remove duplicate columns from a dataframe?
I am reading a text file that has duplicate columns via:
import pandas as pd
df=pd.read_table(fname)
The column names are:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
All the Time and Time Relative columns contain the same data. I want:
Time, Time Relative, N2, H2
All my attempts at dropping, deleting, etc such as:
df=df.T.drop_duplicates().T
Result in uniquely valued index errors:
Reindexing only valid with uniquely valued index objects
Sorry for being a Pandas noob. Any Suggestions would be appreciated.
Additional Details
Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)
data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
Answers:
It sounds like you already know the unique column names. If that’s the case, then df = df['Time', 'Time Relative', 'N2']
would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data that’s messing it up. We could give more help if there’s more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file ‘dummy.csv’ I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table
gives unique columns and works properly:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesn’t let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
If I’m not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu ‘s function, keeping the first of any similarly named columns.
Cols = list(df.columns)
for i,item in enumerate(df.columns):
if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
Here’s a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy()
is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
First step:- Read first row i.e all columns the remove all duplicate columns.
Second step:- Finally read only that columns.
cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.
The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix
. This way, I could then reference and delete the column without the data.
– E
March 2021 update
The subsequent post by @CircArgs may have provided a succinct one-liner to accomplish what I described here.
It looks like you were on the right path. Here is the one-liner you were looking for:
df.reset_index().T.drop_duplicates().T
But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects
, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:
original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
The way below will identify dupe columns to review what is going wrong building the dataframe originally.
dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Fast and easy way to drop the duplicated columns by their values:
df = df.T.drop_duplicates().T
More info: Pandas DataFrame drop_duplicates manual .
Note that Gene Burinsky’s answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:
df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
An update on @kalu’s answer, which uses the latest pandas:
def find_duplicated_columns(df):
dupes = []
columns = df.columns
for i in range(len(columns)):
col1 = df.iloc[:, i]
for j in range(i + 1, len(columns)):
col2 = df.iloc[:, j]
# break early if dtypes aren't the same (helps deal with
# categorical dtypes)
if col1.dtype is not col2.dtype:
break
# otherwise compare values
if col1.equals(col2):
dupes.append(columns[i])
break
return dupes
Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
And you can cast the definition like this:
duplicate_col = get_dup_columns(pd_excel)
It will show a result like the following:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
Although @Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1
generate a SettingWithCopy
warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
I am not sure why Gene Burinsky’s answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.
df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
In case you want to check for duplicate columns, this code can be useful
columns_to_drop= []
for cname in sorted(list(df)):
for cname2 in sorted(list(df))[::-1]:
if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
columns_to_drop.append(cname2)
print(cname,cname2,'Are equal')
df = df.drop(columns_to_drop, axis=1)
What is the easiest way to remove duplicate columns from a dataframe?
I am reading a text file that has duplicate columns via:
import pandas as pd
df=pd.read_table(fname)
The column names are:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
All the Time and Time Relative columns contain the same data. I want:
Time, Time Relative, N2, H2
All my attempts at dropping, deleting, etc such as:
df=df.T.drop_duplicates().T
Result in uniquely valued index errors:
Reindexing only valid with uniquely valued index objects
Sorry for being a Pandas noob. Any Suggestions would be appreciated.
Additional Details
Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)
data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
It sounds like you already know the unique column names. If that’s the case, then df = df['Time', 'Time Relative', 'N2']
would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data that’s messing it up. We could give more help if there’s more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file ‘dummy.csv’ I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table
gives unique columns and works properly:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesn’t let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
If I’m not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu ‘s function, keeping the first of any similarly named columns.
Cols = list(df.columns)
for i,item in enumerate(df.columns):
if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
Here’s a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy()
is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
First step:- Read first row i.e all columns the remove all duplicate columns.
Second step:- Finally read only that columns.
cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.
The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix
. This way, I could then reference and delete the column without the data.
– E
March 2021 update
The subsequent post by @CircArgs may have provided a succinct one-liner to accomplish what I described here.
It looks like you were on the right path. Here is the one-liner you were looking for:
df.reset_index().T.drop_duplicates().T
But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects
, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:
original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
The way below will identify dupe columns to review what is going wrong building the dataframe originally.
dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Fast and easy way to drop the duplicated columns by their values:
df = df.T.drop_duplicates().T
More info: Pandas DataFrame drop_duplicates manual .
Note that Gene Burinsky’s answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:
df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
An update on @kalu’s answer, which uses the latest pandas:
def find_duplicated_columns(df):
dupes = []
columns = df.columns
for i in range(len(columns)):
col1 = df.iloc[:, i]
for j in range(i + 1, len(columns)):
col2 = df.iloc[:, j]
# break early if dtypes aren't the same (helps deal with
# categorical dtypes)
if col1.dtype is not col2.dtype:
break
# otherwise compare values
if col1.equals(col2):
dupes.append(columns[i])
break
return dupes
Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
And you can cast the definition like this:
duplicate_col = get_dup_columns(pd_excel)
It will show a result like the following:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
Although @Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1
generate a SettingWithCopy
warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
I am not sure why Gene Burinsky’s answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.
df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
In case you want to check for duplicate columns, this code can be useful
columns_to_drop= []
for cname in sorted(list(df)):
for cname2 in sorted(list(df))[::-1]:
if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
columns_to_drop.append(cname2)
print(cname,cname2,'Are equal')
df = df.drop(columns_to_drop, axis=1)