Splitting data into train and validation such that all unique queries are present in train, if not append from validation

Question

I’m trying to split my dataset and append unique words into train if they are not present.
Sample input df:

query           word                    label               tag
polish          ['polish']              ['other']           [10]
angle grinder   ['angle', 'grinder']    ['other', 'other']  [10, 10]
vaccum cleaner  ['vaccum', 'cleaner']   ['other', 'other']  [10, 10]

after splitting, train split looks like:

 query          word                    label               tag
polish          ['polish']              ['other']           [10]
angle grinder   ['angle', 'grinder']    ['other', 'other']  [10, 10]

and validation split looks like:

 query          word                    label               tag
vaccum cleaner  ['vaccum', 'cleaner']   ['other', 'other']  [10, 10]

The vacuum cleaner is a unique value and I want to append it to train but at word level, such that my output will be:

query           word                    label               tag
polish          ['polish']              ['other']           [10]
angle grinder   ['angle', 'grinder']    ['other', 'other']  [10, 10]
vacuum          ['vaccum']              ['other']           [10]
cleaner         ['cleaner']             ['other']           [10]

I have tried the following approach:

train_data = df.sample(frac=1 - 0.15, random_state=20)
val_data = df.drop(index=train_data.index)

val_words = set(word for words in val_data['word'] for word in words)
train_words = set(word for words in train_data['word'] for word in words)
new_words = val_words - train_words
new_rows = []
for index, row in val_data.iterrows():
    words = row['word']
    if any(word in new_words for word in words):
        for word, label, tag in zip(words, row['label'], row['tag']):
            new_rows.append((word, [word], [label], [tag]))
train_data = train_data.append(pd.DataFrame(new_rows, columns=train_data.columns), 
ignore_index=True)

With this approach it only append the first word and label are just as it is appended.
How to proceed with this?

Asked By: Prateek Singh

||

Source

Answer 1

def check_train_val_split(train_data, val_data):
    val_words = set(val_data.explode('word')['word'])
    train_words = set(train_data.explode('word')['word'])
    new_words = val_words - train_words

    if not new_words:
        return train_data

    new_rows = val_data[val_data['word'].apply(lambda words: any(word in new_words for word in words))]
    new_rows = new_rows.explode(['word', 'label', 'tag'])
    new_rows[['word', 'label', 'tag']] = new_rows[['word', 'label', 'tag']].
        apply(lambda x: x.apply(lambda y: [y] if isinstance(y, (str, int)) else y))

    train_data = pd.concat([train_data, new_rows], ignore_index=True)
    train_data.drop_duplicates(subset='word', keep="first", inplace=True)

    return train_data

Answered By: member2

Splitting data into train and validation such that all unique queries are present in train, if not append from validation

Question:

Answers: