Splitting data into train and validation such that all unique queries are present in train, if not append from validation

Question:

I’m trying to split my dataset and append unique words into train if they are not present.
Sample input df:

query           word                    label               tag
polish          ['polish']              ['other']           [10]
angle grinder   ['angle', 'grinder']    ['other', 'other']  [10, 10]
vaccum cleaner  ['vaccum', 'cleaner']   ['other', 'other']  [10, 10]

after splitting, train split looks like:

 query          word                    label               tag
polish          ['polish']              ['other']           [10]
angle grinder   ['angle', 'grinder']    ['other', 'other']  [10, 10]

and validation split looks like:

 query          word                    label               tag
vaccum cleaner  ['vaccum', 'cleaner']   ['other', 'other']  [10, 10]

The vacuum cleaner is a unique value and I want to append it to train but at word level, such that my output will be:

query           word                    label               tag
polish          ['polish']              ['other']           [10]
angle grinder   ['angle', 'grinder']    ['other', 'other']  [10, 10]
vacuum          ['vaccum']              ['other']           [10]
cleaner         ['cleaner']             ['other']           [10]

I have tried the following approach:

train_data = df.sample(frac=1 - 0.15, random_state=20)
val_data = df.drop(index=train_data.index)

val_words = set(word for words in val_data['word'] for word in words)
train_words = set(word for words in train_data['word'] for word in words)
new_words = val_words - train_words
new_rows = []
for index, row in val_data.iterrows():
    words = row['word']
    if any(word in new_words for word in words):
        for word, label, tag in zip(words, row['label'], row['tag']):
            new_rows.append((word, [word], [label], [tag]))
train_data = train_data.append(pd.DataFrame(new_rows, columns=train_data.columns), 
ignore_index=True)

With this approach it only append the first word and label are just as it is appended.
How to proceed with this?

Asked By: Prateek Singh

||

Answers:

def check_train_val_split(train_data, val_data):
    val_words = set(val_data.explode('word')['word'])
    train_words = set(train_data.explode('word')['word'])
    new_words = val_words - train_words

    if not new_words:
        return train_data

    new_rows = val_data[val_data['word'].apply(lambda words: any(word in new_words for word in words))]
    new_rows = new_rows.explode(['word', 'label', 'tag'])
    new_rows[['word', 'label', 'tag']] = new_rows[['word', 'label', 'tag']].
        apply(lambda x: x.apply(lambda y: [y] if isinstance(y, (str, int)) else y))

    train_data = pd.concat([train_data, new_rows], ignore_index=True)
    train_data.drop_duplicates(subset='word', keep="first", inplace=True)

    return train_data
Answered By: member2