Apply name-entity recognition on specific dataframe columns with Polars


I would like to apply a specific function to specific columns using polars similar to the following question:
Apply name-entity recognition on specific dataframe columns

Above question works with pandas and it is taking ages for me to run it on my computer. So, I would like to use polars.
Taking from the above question:

    df = pd.DataFrame({'source': ['Paul', 'Paul'],
                   'target': ['GOOGLE', 'Ferrari'],
                   'edge': ['works at', 'drive']
    source  target  edge
0   Paul    GOOGLE  works at
1   Paul    Ferrari drive

Expected outcome with polars:

    source  target  edge      Entitiy
0   Paul    GOOGLE  works at  Person
1   Paul    Ferrari drive     Person
!python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')
df['Entities'] = df['Text'].apply(lambda sent: [(ent.label_) for ent in nlp(sent).ents])  

How can I add a column with label(Person) to the current dataframe with polars?
Thank you.

Asked By: Ozioh



You can run the apply in Polars with the following code:

    entities = pl.col('target').apply(
        lambda sent: [(ent.label_) for ent in nlp(sent).ents])

As @jqurious mentioned, this should not be expected to be faster than Pandas. I ran a couple of tests and it takes the same time as Pandas.

In addition to the comments by @jqurious, you could reduce the number of times the apply function is called if some values are repeated.

You can do that by redefining the function with lru_cache:

from functools import lru_cache
import spacy
import polars as pl

nlp = spacy.load('en_core_web_sm')

def cached_nlp(text):
    return nlp(text)

    entities = pl.col('target').apply(
        lambda sent: [(ent.label_) for ent in cached_nlp(sent).ents])

Answered By: Luca

Building slightly on @Luca’s answer, you can add the caching one level up to avoid the additional list comprehension and jump straight to the list of entity labels:

Polars syntax shown, but equally applicable to pandas:

from functools import lru_cache

@lru_cache(2048)  # << size appropriately 
def entity_labels(s: str) -> list:
    return [(ent.label_) for ent in nlp(s).ents]

        function = entity_labels,
        return_dtype = pl.List(pl.Utf8),
Answered By: alexander-beedie

polars will suffer the same issue as pandas in this case.

Using .apply means you’re essentially using a python for loop.

You can attempt to run the UDF (User-defined function) in parallel with a multiprocessing Pool.

Depending on the particular function/dataset – it may or may not offer a speedup as multiprocessing itself has its own cost – it would be have to measured on a case-by-case basis.

In this case if I expand your example to 10_000 rows – it runs 4x faster.

import spacy
import polars as pl
from   functools import partial
from   multiprocessing import cpu_count, get_context

def recognize(source, nlp):
    return [ent.label_ for ent in nlp(source).ents]

if __name__ == "__main__":
    nlp = spacy.load("en_core_web_sm")

    df = pl.DataFrame({
        "source": ["Paul", "Paul"],
        "target": ["GOOGLE", "Ferrari"],
        "edge": ["works at", "drive"]

    func = partial(recognize, nlp=nlp) # used to pass in `nlp` 
    n_workers = cpu_count() // 2 # experiment with value

    with get_context("spawn").Pool(n_workers) as pool:
        df = df.with_columns(Entity = pl.Series(, df.get_column("source"))))
shape: (2, 4)
│ source | target  | edge     | Entity     │
│ ---    | ---     | ---      | ---        │
│ str    | str     | str      | list[str]  │
│ Paul   | GOOGLE  | works at | ["PERSON"] │
│ Paul   | Ferrari | drive    | ["PERSON"] │
Answered By: jqurious