How can I get automatical features with dfs, using featuretools, when I have only one dataframe?

Question:

I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I’ll work here with only a set of it.

The dataframe is:

train=pd.DataFrame({
'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 
'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 
'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 
'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 
'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
})

I set de dataframe properties:

dataframes = {'train': (train, 'Id')}

Then call the dfs method:

train_feature_matrix, train_feature_names = ft.dfs(dataframes=dataframes, target_dataframe_name='train', max_depth=10, agg_primitives=["mean", "sum", "mode"])

I get the following warning:

UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
agg_primitives: [‘mean’, ‘mode’, ‘sum’]
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used.
warnings.warn(warning_msg, UnusedPrimitiveWarning)

And the train_feature_matrix is exactly as the original train dataframe.

At first, I said that this is because I have a small dataframe and nothing useful can be extracted. But I get the same behavior with the entire dataframe (80 columns and 1460 rows).

Every example I saw on the Featuretools page had 2+ dataframes, but I only have one.

Can you shed some light here? What am I doing wrong?

Asked By: Bogdan Doicin

||

Answers:

Aggregation primitives cannot create features on an EntitySet with a single DataFrame.

This is because the aggregation that they perform occurs over the the one-to-many relationship that exists when you have a parent-child relationship between DataFrames in an EntitySet. The Featuretools guide on primitives has a section that explains the difference here. With your data, that might look like a child DataFrame that has a non-unique house_id column over. Then, running dfs on your train DataFrame would aggregate the desired information for each Id, using every time it shows up in the child DataFrame.

To get get automated feature generation with a single DataFrame, you should use Transform features. The available Transform Primitives can be found here.

Answered By: Tamar Grey

If you only had a data, the library of "headjackai" is more fit on your situation than featuretools. In this library, the feature engineering function were made from datasets, technical speaking, the library provided a embedding space to exchange features on multi-domain in tabular dataset that we can apply the feature from the titanic domain to improve house pricing task.

It is an open community, so you can create many new feature engineering function by yourself or apply others people made in public feature model pool. It has more than a hundred feature model now.

for example,

from headjackai.headjackai_hub import headjackai_hub 

# headjaack experiment 
                                                                 
#host setting
hj_hub = headjackai_hub('http://www.headjackai.com:9000')

#account login
hj_hub.login(username='jimliu_stackoverflow', pwd='jimliu_stackoverflow')

pool_list = hj_hub.knowledgepool_check(True)
score_list = []
task_list = []

# try each feature model
for source in pool_list:
    hj_X = hj_hub.knowledge_transform(data=X, 
                                  target_domain='boston_comparsion', 
                                  source_domain=source,
                                  label='')    

    N_SPLITS = 5
    strat_kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=8888)
    tr_scores = np.empty(N_SPLITS)
    scores = np.empty(N_SPLITS)

    try:
        # cv-5, lgbm, mae
        for idx, (train_idx, test_idx) in enumerate(strat_kf.split(X, y)):
                X_train, X_test = hj_X.iloc[train_idx], hj_X.iloc[test_idx]
                y_train, y_test = y[train_idx], y[test_idx]

                cb_clf = lgbm.LGBMRegressor()

                cb_clf.fit(X_train,y_train)

                preds = cb_clf.predict(X_test)
                loss = mean_absolute_error(y_test, preds)
                scores[idx] = loss

                preds = cb_clf.predict(X_train)
                loss = mean_absolute_error(y_train, preds)

                tr_scores[idx] = loss

        print("-----------------",source,"-----------------")
        print(f"mean score: {tr_scores.mean():.5f}")
        print(f"mean score: {scores.mean():.5f}")
        score_list.append(scores.mean())
        task_list.append(source)
    
    except:
        pass


arg_index = score_list.index(min(score_list))
print(task_list[arg_index], min(score_list))
# ames-house 2.1316169625933044

In the above code sample, I try each feature model on boston pricing task, and pick the best one as our feature engineering function.

In this library, you can get many automated feature generation, even if a single dataset.

Answered By: Jim Liu