How to use marisa-trie in Python for nlp processing

Question

I’m working for a NLP function to store tokens in a trie.

This my well working code for tokenization:

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

then someone else gave me this following code to store tokens in a trie but I can’t understand how it works and I can’t find enough ressources to help me with marisa-trie

import marisa_trie
def create_trie(thesaurus):
    preprocessed_thesaurus = set()
    for keywords_item in thesaurus:
        preprocessed_thesaurus.update(preprocess_text_spacy(keywords_item))
    thesaurus_trie = marisa_trie.Trie(list(preprocessed_thesaurus))
    return thesaurus_trie

In my context, what suppose to be the argument thesaurus to call this function?
What this function suppose to return with thesaurus_trie?
Once the trie created, is it possible to show words stored in the trie?

Any help help will be appreciated

Asked By: Lilly_Co

||

Source

Answer 1

I finally used ChatGPT to answer my question about marisa-trie:

Marisa-trie is a type of compressed trie data structure that is used for fast and memory-efficient string matching. It is implemented in Python through the marisa_trie library.

In the code you provided, the create_trie function takes a thesaurus parameter, which is a collection of keywords. It first preprocesses the keywords using a function called preprocess_text_spacy and stores them in a set called preprocessed_thesaurus. This preprocessing step can involve tasks such as lowercasing, stemming, or removing stop words, depending on the use case.

Once the preprocessing step is complete, the preprocessed_thesaurus set is used to create a Marisa-trie data structure called thesaurus_trie using the marisa_trie.Trie method. This method takes a list of strings as input and constructs a trie structure that efficiently stores and searches for those strings. In this case, the preprocessed_thesaurus set is converted to a list and passed as input to the Trie method.

Finally, the create_trie function returns the thesaurus_trie data structure, which can be used to efficiently search for the keywords in the original thesaurus collection.

To use the thesaurus_trie data structure for string matching, you can call the thesaurus_trie.get() method, which takes a string as input and returns a boolean indicating whether the string is present in the trie. For example:

thesaurus = ['apple', 'banana', 'orange']
thesaurus_trie = create_trie(thesaurus)
print(thesaurus_trie.get('apple'))  # True
print(thesaurus_trie.get('kiwi'))   # False

In this example, the create_trie function is used to create a Marisa-trie data structure from the thesaurus list. The thesaurus_trie.get() method is then used to check if the strings ‘apple’ and ‘kiwi’ are present in the trie, which returns True and False respectively.

Answered By: Lilly_Co

How to use marisa-trie in Python for nlp processing

Question:

Answers: