Splitting data into train and validation such that all unique queries are present in train, if not append from validation
Question:
I’m trying to split my dataset and append unique words into train if they are not present.
Sample input df:
query word label tag
polish ['polish'] ['other'] [10]
angle grinder ['angle', 'grinder'] ['other', 'other'] [10, 10]
vaccum cleaner ['vaccum', 'cleaner'] ['other', 'other'] [10, 10]
after splitting, train split looks like:
query word label tag
polish ['polish'] ['other'] [10]
angle grinder ['angle', 'grinder'] ['other', 'other'] [10, 10]
and validation split looks like:
query word label tag
vaccum cleaner ['vaccum', 'cleaner'] ['other', 'other'] [10, 10]
The vacuum cleaner is a unique value and I want to append it to train but at word level, such that my output will be:
query word label tag
polish ['polish'] ['other'] [10]
angle grinder ['angle', 'grinder'] ['other', 'other'] [10, 10]
vacuum ['vaccum'] ['other'] [10]
cleaner ['cleaner'] ['other'] [10]
I have tried the following approach:
train_data = df.sample(frac=1 - 0.15, random_state=20)
val_data = df.drop(index=train_data.index)
val_words = set(word for words in val_data['word'] for word in words)
train_words = set(word for words in train_data['word'] for word in words)
new_words = val_words - train_words
new_rows = []
for index, row in val_data.iterrows():
words = row['word']
if any(word in new_words for word in words):
for word, label, tag in zip(words, row['label'], row['tag']):
new_rows.append((word, [word], [label], [tag]))
train_data = train_data.append(pd.DataFrame(new_rows, columns=train_data.columns),
ignore_index=True)
With this approach it only append the first word and label are just as it is appended.
How to proceed with this?
Answers:
def check_train_val_split(train_data, val_data):
val_words = set(val_data.explode('word')['word'])
train_words = set(train_data.explode('word')['word'])
new_words = val_words - train_words
if not new_words:
return train_data
new_rows = val_data[val_data['word'].apply(lambda words: any(word in new_words for word in words))]
new_rows = new_rows.explode(['word', 'label', 'tag'])
new_rows[['word', 'label', 'tag']] = new_rows[['word', 'label', 'tag']].
apply(lambda x: x.apply(lambda y: [y] if isinstance(y, (str, int)) else y))
train_data = pd.concat([train_data, new_rows], ignore_index=True)
train_data.drop_duplicates(subset='word', keep="first", inplace=True)
return train_data
I’m trying to split my dataset and append unique words into train if they are not present.
Sample input df:
query word label tag
polish ['polish'] ['other'] [10]
angle grinder ['angle', 'grinder'] ['other', 'other'] [10, 10]
vaccum cleaner ['vaccum', 'cleaner'] ['other', 'other'] [10, 10]
after splitting, train split looks like:
query word label tag
polish ['polish'] ['other'] [10]
angle grinder ['angle', 'grinder'] ['other', 'other'] [10, 10]
and validation split looks like:
query word label tag
vaccum cleaner ['vaccum', 'cleaner'] ['other', 'other'] [10, 10]
The vacuum cleaner is a unique value and I want to append it to train but at word level, such that my output will be:
query word label tag
polish ['polish'] ['other'] [10]
angle grinder ['angle', 'grinder'] ['other', 'other'] [10, 10]
vacuum ['vaccum'] ['other'] [10]
cleaner ['cleaner'] ['other'] [10]
I have tried the following approach:
train_data = df.sample(frac=1 - 0.15, random_state=20)
val_data = df.drop(index=train_data.index)
val_words = set(word for words in val_data['word'] for word in words)
train_words = set(word for words in train_data['word'] for word in words)
new_words = val_words - train_words
new_rows = []
for index, row in val_data.iterrows():
words = row['word']
if any(word in new_words for word in words):
for word, label, tag in zip(words, row['label'], row['tag']):
new_rows.append((word, [word], [label], [tag]))
train_data = train_data.append(pd.DataFrame(new_rows, columns=train_data.columns),
ignore_index=True)
With this approach it only append the first word and label are just as it is appended.
How to proceed with this?
def check_train_val_split(train_data, val_data):
val_words = set(val_data.explode('word')['word'])
train_words = set(train_data.explode('word')['word'])
new_words = val_words - train_words
if not new_words:
return train_data
new_rows = val_data[val_data['word'].apply(lambda words: any(word in new_words for word in words))]
new_rows = new_rows.explode(['word', 'label', 'tag'])
new_rows[['word', 'label', 'tag']] = new_rows[['word', 'label', 'tag']].
apply(lambda x: x.apply(lambda y: [y] if isinstance(y, (str, int)) else y))
train_data = pd.concat([train_data, new_rows], ignore_index=True)
train_data.drop_duplicates(subset='word', keep="first", inplace=True)
return train_data