Hypothesis python package for onehots and longitudinal data

Question:

For context I work with mixed tabular data. I have complex data pipelines that I’d like to make sure works on any configuration of data.

I see the pandas add-on/extra and have some questions related to that.

  1. How would I generate one-hot columns with this package? Right now I’m just creating a column of integers between (0, nclasses-1) and then one hot encoding after, but it adds up to have to do that every time.

  2. How would I generate longitudinal data with this package? Say I want a multi index and then to generate a bunch of data for that?

  3. Can I control the missingness more precisely? For example, integer strategy doesn’t allow missingness. How would that also factor into multi-categorical data? Or should I just do it myself later.

Edit to add:
4. I would also be interested in trying a mix of columns as well and not always having all columns at all times.

For example this is what I have right now for data that mixes a continuous, binary, and multicategorical feature and then one-hot encodes the latter.

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
import unittest

class TestTransforms(unittest.TestCase):
    @given(
        data_frames(
            columns=[
                # create continuous var
                column("ctn", dtype=float),
                # create binary var
                column("bin", elements=st.integers(0, 1)),
                # create multicategorical (numerically encoded) var
                column("mult", elements=st.integers(0, 2)),
            ]
        )
    )
    def test_hypothesis(self, df):
        # one-hot encode the multicategorical column
        df = pd.concat(
            [
                df.drop(["mult"], axis=1),
                pd.get_dummies(df["mult"], prefix="mult"),
            ],
            axis=1,
        )

if __name__ == "__main__":
    unittest.main()

Final edit: Here is the final version that works for me as I wanted it to!

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
import unittest


def onehot_multicategorical_column(
    prefix: str,
) -> Callable[[pd.DataFrame], pd.DataFrame]:
    def integrate_onehots(df: pd.DataFrame) -> pd.DataFrame:
        if df[prefix].empty:
            return df
        dummies = pd.get_dummies(df, columns=[prefix], prefix=prefix, dummy_na=True)
        # Retain nans
        dummies.loc[
            dummies[f"{prefix}_nan"].astype(bool),
            dummies.columns.str.startswith(prefix),
        ] = np.nan
        return dummies.drop(f"{prefix}_nan", axis=1)

    return integrate_onehots


def unpack_tuples(nested_tuples):
    """
    We receive a List[Tuple[int, List[int]]].
    The first int is the numerical id, and the second is the "time point".
    We want to flatten this into a List[Tuple[int, int]] with the same
    id for multiple time points.
    E.g. [(0,[0,1,2]), (1,[0,2])] => [(0,0), (0,1), (0,2), (1,0), (1,2)]
    """
    return [
        (pt_id, time_pt) for pt_id, time_pts in nested_tuples for time_pt in time_pts
    ]

class TestTransforms(unittest.TestCase):
    @given(
        data_frames(
            columns=[
                column("ctn", dtype=float),
                column("bin", elements=st.one_of(st.none(), st.integers(0, 1))),
                column(
                    "mult", elements=st.one_of(st.none(), st.sampled_from([0, 1, 2]))
                ),
            ],
            index=st.builds(
                pd.MultiIndex.from_tuples,
                st.lists(
                    st.tuples(
                        st.integers(0), st.lists(st.integers(0), min_size=1, max_size=5)
                    ),
                    min_size=2,
                ).map(unpack_tuples),
            ),
        ).map(onehot_multicategorical_column("mult"))
    )
    def test_hypothesis(self, df):
    def test_hypothesis(self, df):
        # test stuff with df


if __name__ == "__main__":
    unittest.main()
Asked By: davzaman

||

Answers:

How would I generate one-hot columns with this package? Right now I’m just creating a column of integers between (0, nclasses-1) and then one hot encoding after, but it adds up to have to do that every time.

That – or something equivalent like sampled_from(column_names) – is exactly how I’d do it. A helper function and .map(categories_to_one_hot_columns) method should make this reasonably easy.

How would I generate longitudinal data with this package? Say I want a multi index and then to generate a bunch of data for that?

The pdst.series() and pdst.data_frames() strategies both accept an index= argument, which you could define as e.g.

index = st.builds(
    pd.MultiIndex.from_tuples,
    st.lists(st.tuples(...), min_size=1, max_size=10)
)

Can I control the missingness more precisely? For example, integer strategy doesn’t allow missingness. How would that also factor into multi-categorical data? Or should I just do it myself later.

I’d use st.none() | st.integers() for missingness, or more generally st.one_of(...), can be used to mix strategies together.

Answered By: Zac Hatfield-Dodds
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.