Find mixed types in Pandas columns
Question:
Ever so often I get this warning when parsing data files:
WARNING:py.warnings:/usr/local/python3/miniconda/lib/python3.4/site-
packages/pandas-0.16.0_12_gdcc7431-py3.4-linux-x86_64.egg/pandas
/io/parsers.py:1164: DtypeWarning: Columns (0,2,14,20) have mixed types.
Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
But if the data is large (I have 50k rows), how can I find WHERE in the data the change of dtype occurs?
Answers:
I’m not entirely sure what you’re after, but it’s easy enough to find the rows which contain elements which don’t share the type of the first row. For example:
>>> df = pd.DataFrame({"A": np.arange(500), "B": np.arange(500.0)})
>>> df.loc[321, "A"] = "Fred"
>>> df.loc[325, "B"] = True
>>> weird = (df.applymap(type) != df.iloc[0].apply(type)).any(axis=1)
>>> df[weird]
A B
321 Fred 321
325 325 True
In addition to DSM’s answer, with a many-column dataframe it can be helpful to find the columns that change type like so:
for col in df.columns:
weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[weird]) > 0:
print(col)
This approach uses pandas.api.types.infer_dtype
to find the columns which have mixed dtypes. It was tested with Pandas 1 under Python 3.8.
Note that this answer has multiple uses of assignment expressions which work only with Python 3.8 or newer. It can however trivially be modified to not use them.
if mixed_dtypes := {c: dtype for c in df.columns if (dtype := pd.api.types.infer_dtype(df[c])).startswith("mixed")}:
raise TypeError(f"Dataframe has one more mixed dtypes: {mixed_dtypes}")
This approach doesn’t however find a row with the changed dtype.
Create sample data with a column that has 2 data types
import seaborn
iris = seaborn.load_dataset("iris")
# Change one row to another type
iris.loc[0,"sepal_length"] = iris.loc[0,"sepal_length"].astype(str)
When columns use more than one type, print the column name and the types used:
for col in iris.columns:
unique_types = iris[col].apply(type).unique()
if len(unique_types) > 1:
print(col, unique_types)
To fix the column types you can:
- use
df[col] = df[col].astype(str)
to change the data type.
- or if the data frame was read from a csv file define the ̀dtype` argument in a dictionary of columns.
Ever so often I get this warning when parsing data files:
WARNING:py.warnings:/usr/local/python3/miniconda/lib/python3.4/site-
packages/pandas-0.16.0_12_gdcc7431-py3.4-linux-x86_64.egg/pandas
/io/parsers.py:1164: DtypeWarning: Columns (0,2,14,20) have mixed types.
Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
But if the data is large (I have 50k rows), how can I find WHERE in the data the change of dtype occurs?
I’m not entirely sure what you’re after, but it’s easy enough to find the rows which contain elements which don’t share the type of the first row. For example:
>>> df = pd.DataFrame({"A": np.arange(500), "B": np.arange(500.0)})
>>> df.loc[321, "A"] = "Fred"
>>> df.loc[325, "B"] = True
>>> weird = (df.applymap(type) != df.iloc[0].apply(type)).any(axis=1)
>>> df[weird]
A B
321 Fred 321
325 325 True
In addition to DSM’s answer, with a many-column dataframe it can be helpful to find the columns that change type like so:
for col in df.columns:
weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[weird]) > 0:
print(col)
This approach uses pandas.api.types.infer_dtype
to find the columns which have mixed dtypes. It was tested with Pandas 1 under Python 3.8.
Note that this answer has multiple uses of assignment expressions which work only with Python 3.8 or newer. It can however trivially be modified to not use them.
if mixed_dtypes := {c: dtype for c in df.columns if (dtype := pd.api.types.infer_dtype(df[c])).startswith("mixed")}:
raise TypeError(f"Dataframe has one more mixed dtypes: {mixed_dtypes}")
This approach doesn’t however find a row with the changed dtype.
Create sample data with a column that has 2 data types
import seaborn
iris = seaborn.load_dataset("iris")
# Change one row to another type
iris.loc[0,"sepal_length"] = iris.loc[0,"sepal_length"].astype(str)
When columns use more than one type, print the column name and the types used:
for col in iris.columns:
unique_types = iris[col].apply(type).unique()
if len(unique_types) > 1:
print(col, unique_types)
To fix the column types you can:
- use
df[col] = df[col].astype(str)
to change the data type. - or if the data frame was read from a csv file define the ̀dtype` argument in a dictionary of columns.