Pandas. Check if column names are unique
Question:
I’m importing some data to a pandas DataFrame using the read_excel()
method. The problem is, that my .xlsx file may have some columns with the same names(for example ‘gamma’ and ‘gamma’). I’m not going to work with such data, I’m going to throw an Exception here. But I can’t figure out how to check if columns are unique. After doing the import, pandas renames the column with the same name by adding .digit to it at the end and I can’t do something like this (len(list(df.columns)) == len(set(df.columns))
).
Note: I can have some cases when the actual column name ends with .digit, so this solution can introduce some bugs, so I can’t do if any(".1" in col for col in df.columns): raise Exception(...)
alpha beta gamma gamma.1
0 1 9 1 35
1 2 8 543 12
2 3 7 6 45
3 4 6 4 64
4 5 5 2 865
5 6 4 56 235
6 7 3 6 124
7 8 2 2 135
8 9 1 26 767
How can I make check for column name duplicates? Thank you.
Answers:
If you just want to check you could do something like:
EDIT: I did as you asked with only a single read
import openpyxl
def load_df(path, sheet_name):
wb = openpyxl.load_workbook(path)
ws = wb[sheet_name]
lst = [cell.value for cell in ws[1]]
if len(lst) != len(set(lst)):
raise Exception("foo")
return pd.DataFrame(ws.values)
df = load_df("data/test.xlsx", "Sheet1")
It might be worth keeping an eye on future updates of pandas. If you read the docs it states the following:
mangle_dupe_cols: bool, default True
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
Deprecated since version 1.5.0: Not implemented, and a new argument to
specify the pattern for the names of duplicated columns will be added
instead
So in future you’ll be able to write simpler code that can handle it.
Read just your header as normal data and check for duplicates:
import pandas as pd
df = pd.read_excel("file.xlsx", header=None, nrows=1)
if not df.iloc[0].is_unique:
raise Exception("Duplicates")
This requires you opening the file twice. Although, the first time you are only reading one row from it. If you still want to avoid it, you can do this:
df = pd.read_excel("file.xlsx", header=None)
columns = df.iloc[0]
if not columns.is_unique:
raise Exception("Duplicates")
df.drop(0, inplace=True)
df.columns = columns
I’m importing some data to a pandas DataFrame using the read_excel()
method. The problem is, that my .xlsx file may have some columns with the same names(for example ‘gamma’ and ‘gamma’). I’m not going to work with such data, I’m going to throw an Exception here. But I can’t figure out how to check if columns are unique. After doing the import, pandas renames the column with the same name by adding .digit to it at the end and I can’t do something like this (len(list(df.columns)) == len(set(df.columns))
).
Note: I can have some cases when the actual column name ends with .digit, so this solution can introduce some bugs, so I can’t do if any(".1" in col for col in df.columns): raise Exception(...)
alpha beta gamma gamma.1
0 1 9 1 35
1 2 8 543 12
2 3 7 6 45
3 4 6 4 64
4 5 5 2 865
5 6 4 56 235
6 7 3 6 124
7 8 2 2 135
8 9 1 26 767
How can I make check for column name duplicates? Thank you.
If you just want to check you could do something like:
EDIT: I did as you asked with only a single read
import openpyxl
def load_df(path, sheet_name):
wb = openpyxl.load_workbook(path)
ws = wb[sheet_name]
lst = [cell.value for cell in ws[1]]
if len(lst) != len(set(lst)):
raise Exception("foo")
return pd.DataFrame(ws.values)
df = load_df("data/test.xlsx", "Sheet1")
It might be worth keeping an eye on future updates of pandas. If you read the docs it states the following:
mangle_dupe_cols: bool, default True
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
Deprecated since version 1.5.0: Not implemented, and a new argument to
specify the pattern for the names of duplicated columns will be added
instead
So in future you’ll be able to write simpler code that can handle it.
Read just your header as normal data and check for duplicates:
import pandas as pd
df = pd.read_excel("file.xlsx", header=None, nrows=1)
if not df.iloc[0].is_unique:
raise Exception("Duplicates")
This requires you opening the file twice. Although, the first time you are only reading one row from it. If you still want to avoid it, you can do this:
df = pd.read_excel("file.xlsx", header=None)
columns = df.iloc[0]
if not columns.is_unique:
raise Exception("Duplicates")
df.drop(0, inplace=True)
df.columns = columns