Pandas. Check if column names are unique

Question:

I’m importing some data to a pandas DataFrame using the read_excel() method. The problem is, that my .xlsx file may have some columns with the same names(for example ‘gamma’ and ‘gamma’). I’m not going to work with such data, I’m going to throw an Exception here. But I can’t figure out how to check if columns are unique. After doing the import, pandas renames the column with the same name by adding .digit to it at the end and I can’t do something like this (len(list(df.columns)) == len(set(df.columns))).

Note: I can have some cases when the actual column name ends with .digit, so this solution can introduce some bugs, so I can’t do if any(".1" in col for col in df.columns): raise Exception(...)

    alpha   beta    gamma   gamma.1
0   1   9   1   35
1   2   8   543 12
2   3   7   6   45
3   4   6   4   64
4   5   5   2   865
5   6   4   56  235
6   7   3   6   124
7   8   2   2   135
8   9   1   26  767

How can I make check for column name duplicates? Thank you.

Asked By: Mykhailo Yurchenko

||

Answers:

If you just want to check you could do something like:

EDIT: I did as you asked with only a single read

import openpyxl
def load_df(path, sheet_name):
    wb = openpyxl.load_workbook(path)
    ws = wb[sheet_name]
    lst = [cell.value for cell in ws[1]]
    if len(lst) != len(set(lst)):
        raise Exception("foo")

    return pd.DataFrame(ws.values)

df = load_df("data/test.xlsx", "Sheet1")

It might be worth keeping an eye on future updates of pandas. If you read the docs it states the following:

mangle_dupe_cols: bool, default True

Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

Deprecated since version 1.5.0: Not implemented, and a new argument to
specify the pattern for the names of duplicated columns will be added
instead

So in future you’ll be able to write simpler code that can handle it.

Answered By: Rykari

Read just your header as normal data and check for duplicates:

import pandas as pd

df = pd.read_excel("file.xlsx", header=None, nrows=1)
if not df.iloc[0].is_unique:
    raise Exception("Duplicates")

This requires you opening the file twice. Although, the first time you are only reading one row from it. If you still want to avoid it, you can do this:

df = pd.read_excel("file.xlsx", header=None)

columns = df.iloc[0]
if not columns.is_unique:
    raise Exception("Duplicates")

df.drop(0, inplace=True)
df.columns = columns
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.