Using SpanMarker with spaCy¶

SpanMarker is an accessible yet powerful Python module for training Named Entity Recognition models.

In this short notebook, we’ll have a look at using pretrained SpanMarker models with spaCy.

Setup¶

First of all, both spacy and the span_marker Python module need to be installed. Afterwards, we need to install a spacy model, too. We’ll choose the simplest one for now: en_core_web_sm

[ ]:
%pip install span_marker spacy
!spacy download en_core_web_sm

Using spaCy for Named Entity Recognition¶

We’ll start off by using purely spaCy for NER, to help give an indication of the changes that need to be made to use SpanMarker models for NER instead.

[2]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Feed some text through the model to get a spacy Doc
text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \
death in 30 BCE."""
doc = nlp(text)

# And look at the entities
doc.ents
[2]:
(Cleopatra the Great,
 the Ptolemaic Kingdom of Egypt,
 69,
 BCE,
 Egypt,
 51,
 BCE,
 30,
 BCE)

The spaCy module comes with a convenient visualizer that we can use to inspect these entities in a more convenient way, let’s use it.

[3]:
from spacy import displacy

displacy.render(doc, style="ent")
Cleopatra VII, also known as Cleopatra the Great WORK_OF_ART , was the last active ruler of the Ptolemaic Kingdom of Egypt GPE . She was born in 69 CARDINAL BCE ORG and ruled Egypt GPE from 51 CARDINAL BCE ORG until her death in 30 CARDINAL BCE ORG .

Not quite ideal. This spaCy model misses Cleopatra VII, considers Cleopatra the Great a work of art, and thinks all dates are cardinals and organisations.

Using SpanMarker models for Named Entity Recognition with spaCy¶

We can easily add a SpanMarker model as a drop-in replacement of the original spaCy NER pipeline. It’s as simple as one line of code.

[ ]:
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

The configuration model refers to tomaarsen/span-marker-roberta-large-ontonotes5, a model trained on OntoNotes v5.0, the same dataset that is used by the original spaCy NER pipeline. The spaCy integration API reference has more documentation on the configuration options. Let’s try out the updated spaCy pipeline.

[5]:
# All we have to do is process the text using the updated spaCy pipeline
doc = nlp(text)

print(doc.ents)

displacy.render(doc, style="ent")
(Cleopatra VII, Cleopatra the Great, the Ptolemaic Kingdom of Egypt, 69 BCE, Egypt, 51 BCE, 30 BCE)
Cleopatra VII PERSON , also known as Cleopatra the Great PERSON , was the last active ruler of the Ptolemaic Kingdom of Egypt GPE . She was born in 69 BCE DATE and ruled Egypt GPE from 51 BCE DATE until her death in 30 BCE DATE .

Much better!

But, what if we don’t want to use a model with these labels? Well, this integration works for any SpanMarker model on the Hugging Face Hub, so we can just pick another one. Let’s now also ensure that the model stays on the CPU, just to see how that works. Beyond that, we’ll overwrite entities from spaCy’s own NER model. This is recommended when the SpanMarker model uses a different label scheme than spaCy, which uses the labels from OntoNotes v5.

[6]:
nlp.remove_pipe("span_marker")
nlp.add_pipe(
    "span_marker",
    config={
        "model": "tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super",
        "device": "cpu",
        "overwrite_entities": True,
    },
)

doc = nlp(text)
print(doc.ents)
displacy.render(doc, style="ent")

SpanMarker model predictions are being computed on the CPU while CUDA is available. Moving the model to CUDA using `model.cuda()` before performing predictions is heavily recommended to significantly boost prediction speeds.
(Cleopatra VII, Cleopatra the Great, Egypt, Egypt)
Cleopatra VII person-politician , also known as Cleopatra the Great person-politician , was the last active ruler of the Ptolemaic Kingdom of Egypt location-GPE . She was born in 69 BCE and ruled Egypt location-GPE from 51 BCE until her death in 30 BCE.

Summary¶

To summarize, using SpanMarker with spaCy is as simple as this:

[7]:
import spacy

nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker")

text = "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris."
doc = nlp(text)

[(entity, entity.label_) for entity in doc.ents]
[7]:
[(Amelia Earhart, 'PERSON'),
 (Lockheed, 'ORG'),
 (Vega 5B, 'PRODUCT'),
 (Atlantic, 'LOC'),
 (Paris, 'GPE')]