How to catch any words in TfidfVectorizer by token_pattern
Question:
I’d like to catch any words separated by just space in TfidfVectorizer, even if the words like "0" "a" "x" "0?0" and so on.
I wrote the below code for this purpose.
However, maybe, this code doesn’t work well.
vectorizer = TfidfVectorizer(smooth_idf = False, token_pattern=r"[^ ]+")
P.S.
I could get a right pattern matching by using ‘b’ .
Thanks a lot.
Answers:
You may be looking for word boundaries:
bS+b
Explanation:
b
looks for a word boundary, in the first instance of usage it will look for the start of a word (first words after a newline or anything after a space (or type of whitespace))
S+
matches non whitespace characters at least once (the word you are looking for)
- Second
b
matches end of word matched
Usage:
For string: Greetings from Spain
it’d match Greetings
, from
and Spain
I’d like to catch any words separated by just space in TfidfVectorizer, even if the words like "0" "a" "x" "0?0" and so on.
I wrote the below code for this purpose.
However, maybe, this code doesn’t work well.
vectorizer = TfidfVectorizer(smooth_idf = False, token_pattern=r"[^ ]+")
P.S.
I could get a right pattern matching by using ‘b’ .
Thanks a lot.
You may be looking for word boundaries:
bS+b
Explanation:
b
looks for a word boundary, in the first instance of usage it will look for the start of a word (first words after a newline or anything after a space (or type of whitespace))S+
matches non whitespace characters at least once (the word you are looking for)- Second
b
matches end of word matched
Usage:
For string: Greetings from Spain
it’d match Greetings
, from
and Spain