How to use marisa-trie in Python for nlp processing
Question:
I’m working for a NLP function to store tokens in a trie.
This my well working code for tokenization:
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)
then someone else gave me this following code to store tokens in a trie but I can’t understand how it works and I can’t find enough ressources to help me with marisa-trie
import marisa_trie
def create_trie(thesaurus):
preprocessed_thesaurus = set()
for keywords_item in thesaurus:
preprocessed_thesaurus.update(preprocess_text_spacy(keywords_item))
thesaurus_trie = marisa_trie.Trie(list(preprocessed_thesaurus))
return thesaurus_trie
In my context, what suppose to be the argument thesaurus
to call this function?
What this function suppose to return with thesaurus_trie
?
Once the trie created, is it possible to show words stored in the trie?
Any help help will be appreciated
Answers:
I finally used ChatGPT to answer my question about marisa-trie:
Marisa-trie is a type of compressed trie data structure that is used for fast and memory-efficient string matching. It is implemented in Python through the marisa_trie library.
In the code you provided, the create_trie
function takes a thesaurus
parameter, which is a collection of keywords. It first preprocesses the keywords using a function called preprocess_text_spacy
and stores them in a set
called preprocessed_thesaurus
. This preprocessing step can involve tasks such as lowercasing, stemming, or removing stop words, depending on the use case.
Once the preprocessing step is complete, the preprocessed_thesaurus
set is used to create a Marisa-trie data structure called thesaurus_trie
using the marisa_trie.Trie
method. This method takes a list of strings as input and constructs a trie structure that efficiently stores and searches for those strings. In this case, the preprocessed_thesaurus
set is converted to a list and passed as input to the Trie
method.
Finally, the create_trie
function returns the thesaurus_trie
data structure, which can be used to efficiently search for the keywords in the original thesaurus
collection.
To use the thesaurus_trie
data structure for string matching, you can call the thesaurus_trie.get()
method, which takes a string as input and returns a boolean indicating whether the string is present in the trie. For example:
thesaurus = ['apple', 'banana', 'orange']
thesaurus_trie = create_trie(thesaurus)
print(thesaurus_trie.get('apple')) # True
print(thesaurus_trie.get('kiwi')) # False
In this example, the create_trie
function is used to create a Marisa-trie data structure from the thesaurus list. The thesaurus_trie.get()
method is then used to check if the strings ‘apple’ and ‘kiwi’ are present in the trie, which returns True and False respectively.
I’m working for a NLP function to store tokens in a trie.
This my well working code for tokenization:
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)
then someone else gave me this following code to store tokens in a trie but I can’t understand how it works and I can’t find enough ressources to help me with marisa-trie
import marisa_trie
def create_trie(thesaurus):
preprocessed_thesaurus = set()
for keywords_item in thesaurus:
preprocessed_thesaurus.update(preprocess_text_spacy(keywords_item))
thesaurus_trie = marisa_trie.Trie(list(preprocessed_thesaurus))
return thesaurus_trie
In my context, what suppose to be the argument thesaurus
to call this function?
What this function suppose to return with thesaurus_trie
?
Once the trie created, is it possible to show words stored in the trie?
Any help help will be appreciated
I finally used ChatGPT to answer my question about marisa-trie:
Marisa-trie is a type of compressed trie data structure that is used for fast and memory-efficient string matching. It is implemented in Python through the marisa_trie library.
In the code you provided, the create_trie
function takes a thesaurus
parameter, which is a collection of keywords. It first preprocesses the keywords using a function called preprocess_text_spacy
and stores them in a set
called preprocessed_thesaurus
. This preprocessing step can involve tasks such as lowercasing, stemming, or removing stop words, depending on the use case.
Once the preprocessing step is complete, the preprocessed_thesaurus
set is used to create a Marisa-trie data structure called thesaurus_trie
using the marisa_trie.Trie
method. This method takes a list of strings as input and constructs a trie structure that efficiently stores and searches for those strings. In this case, the preprocessed_thesaurus
set is converted to a list and passed as input to the Trie
method.
Finally, the create_trie
function returns the thesaurus_trie
data structure, which can be used to efficiently search for the keywords in the original thesaurus
collection.
To use the thesaurus_trie
data structure for string matching, you can call the thesaurus_trie.get()
method, which takes a string as input and returns a boolean indicating whether the string is present in the trie. For example:
thesaurus = ['apple', 'banana', 'orange']
thesaurus_trie = create_trie(thesaurus)
print(thesaurus_trie.get('apple')) # True
print(thesaurus_trie.get('kiwi')) # False
In this example, the create_trie
function is used to create a Marisa-trie data structure from the thesaurus list. The thesaurus_trie.get()
method is then used to check if the strings ‘apple’ and ‘kiwi’ are present in the trie, which returns True and False respectively.