Pythonic type hints with pandas?
Question:
Let’s take a simple function that takes a str and returns a dataframe:
import pandas as pd
def csv_to_df(path):
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
What is the recommended pythonic way of adding type hints to this function?
If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame
.
The following won’t work though, as it’ll tell me that pandas is not defined.
def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Answers:
I’m currently doing the following:
from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Which gives:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> ~pandas.core.frame.DataFrame
Don’t know how pythonic that is, but it’s understandable enough as a type hint, I find.
Why not just use pd.DataFrame
?
import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Result is the same:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame
This is straying from the original question but building off of @dangom’s answer using TypeVar
and @Georgy’s comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:
from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Now there is a pip package that can help with this.
https://github.com/CedricFR/dataenforce
You can install it with pip install dataenforce
and use very pythonic type hints like:
def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
pass
Check out the answer given here which explains the usage of the package data-science-types
.
pip install data-science-types
Demo
# program.py
import pandas as pd
df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment
Run using mypy the same way:
$ mypy program.py
Take a look at pandera.
pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.
Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.
The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:
import pandas as pd
import pandera
from pandera.typing import DataFrame, Series
class Integers(pandera.SchemaModel):
number: Series[int]
@pandera.check_types
def my_fn(a: DataFrame[Integers]) -> None:
pass
# This works
df = pd.DataFrame({"number": [ 2002, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ 2002.0, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ '2002', 2003]})
my_fn(df)
Let’s take a simple function that takes a str and returns a dataframe:
import pandas as pd
def csv_to_df(path):
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
What is the recommended pythonic way of adding type hints to this function?
If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame
.
The following won’t work though, as it’ll tell me that pandas is not defined.
def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
I’m currently doing the following:
from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Which gives:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> ~pandas.core.frame.DataFrame
Don’t know how pythonic that is, but it’s understandable enough as a type hint, I find.
Why not just use pd.DataFrame
?
import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Result is the same:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame
This is straying from the original question but building off of @dangom’s answer using TypeVar
and @Georgy’s comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:
from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
return pd.read_csv(path, skiprows=1, sep='t', comment='#')
Now there is a pip package that can help with this.
https://github.com/CedricFR/dataenforce
You can install it with pip install dataenforce
and use very pythonic type hints like:
def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
pass
Check out the answer given here which explains the usage of the package data-science-types
.
pip install data-science-types
Demo
# program.py
import pandas as pd
df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment
Run using mypy the same way:
$ mypy program.py
Take a look at pandera.
pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.
Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.
The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:
import pandas as pd
import pandera
from pandera.typing import DataFrame, Series
class Integers(pandera.SchemaModel):
number: Series[int]
@pandera.check_types
def my_fn(a: DataFrame[Integers]) -> None:
pass
# This works
df = pd.DataFrame({"number": [ 2002, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ 2002.0, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ '2002', 2003]})
my_fn(df)