How to speed up language-tool-python library use case

Question:

I have a pandas dataframe with 3 million rows of social media comments. I’m using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.

Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check() method with the string you want to check as a parameter.

>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2

So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row))). Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.

Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.

Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.

I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.

edit – Correction.

It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.

Asked By: Fardin Ahsan

||

Answers:

If you are worried about scaling up with pandas, switch to Dask instead. It integrates with Pandas and will use multiple cores in your CPU, which I am assuming you have, instead of a single-core that pandas use. This helps parallelize the 3 million instances and can speed up your execution time. You can read more about dask here or see an example here.

Answered By: Taslim

I’m the creator of language_tool_python. First, none of the comments here make sense. The bottleneck is in tool.check(); there is nothing slow about using pd.DataFrame.map().

LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:

Method 1: Initialize multiple servers

servers = []
for i in range(100):
  servers.append(language_tool_python.LanguageTool('en-US'))

Then call to each server from a different thread. Or alternatively initialize each server within its own thread.

Method 2: Increase the thread count

LanguageTool takes a maxCheckThreads option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool’s source code, it looks like the default number of threads in a single LanguageTool server is 10.

Answered By: jxmorris12