Applying Spacy Parser to Pandas DataFrame w/ Multiprocessing

python nlp multiprocessing spacy

Spacy is highly optimised and does the multiprocessing for you. As a result, I think your best bet is to take the data out of the Dataframe and pass it to the Spacy pipeline as a list rather than trying to use .apply directly.

You then need to the collate the results of the parse, and put this back into the Dataframe.

So, in your example, you could use something like:

tokens = []lemma = []pos = []for doc in nlp.pipe(df['species'].astype('unicode').values, batch_size=50,                        n_threads=3):    if doc.is_parsed:        tokens.append([n.text for n in doc])        lemma.append([n.lemma_ for n in doc])        pos.append([n.pos_ for n in doc])    else:        # We want to make sure that the lists of parsed results have the        # same number of entries of the original Dataframe, so add some blanks in case the parse fails        tokens.append(None)        lemma.append(None)        pos.append(None)df['species_tokens'] = tokensdf['species_lemma'] = lemmadf['species_pos'] = pos

This approach will work fine on small datasets, but it eats up your memory, so not great if you want to process huge amounts of text.

CodeHunter

Applying Spacy Parser to Pandas DataFrame w/ Multiprocessing

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last