Cleopatra VII, also known as \n",
"\n",
" Cleopatra the Great\n",
" WORK_OF_ART\n",
"\n",
", was the last active ruler of \n",
"\n",
" the Ptolemaic Kingdom of Egypt\n",
" GPE\n",
"\n",
". She was born in \n",
"\n",
" 69\n",
" CARDINAL\n",
"\n",
" \n",
"\n",
" BCE\n",
" ORG\n",
"\n",
" and ruled \n",
"\n",
" Egypt\n",
" GPE\n",
"\n",
" from \n",
"\n",
" 51\n",
" CARDINAL\n",
"\n",
" \n",
"\n",
" BCE\n",
" ORG\n",
"\n",
" until her death in \n",
"\n",
" 30\n",
" CARDINAL\n",
"\n",
" \n",
"\n",
" BCE\n",
" ORG\n",
"\n",
".
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from spacy import displacy\n",
"\n",
"displacy.render(doc, style=\"ent\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Not quite ideal. This spaCy model misses `Cleopatra VII`, considers `Cleopatra the Great` a work of art, and thinks all dates are cardinals and organisations."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using SpanMarker models for Named Entity Recognition with spaCy\n",
"We can easily add a SpanMarker model as a drop-in replacement of the original spaCy NER pipeline. It's as simple as one line of code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nlp.add_pipe(\"span_marker\", config={\"model\": \"tomaarsen/span-marker-roberta-large-ontonotes5\"})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The configuration model refers to [tomaarsen/span-marker-roberta-large-ontonotes5](https://huggingface.co/tomaarsen/span-marker-roberta-large-ontonotes5), a model trained on OntoNotes v5.0, the same dataset that is used by the original spaCy NER pipeline. The [spaCy integration API reference](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.spacy_integration.html) has more documentation on the configuration options. Let's try out the updated spaCy pipeline."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(Cleopatra VII, Cleopatra the Great, the Ptolemaic Kingdom of Egypt, 69 BCE, Egypt, 51 BCE, 30 BCE)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Cleopatra VII\n",
" PERSON\n",
"\n",
", also known as \n",
"\n",
" Cleopatra the Great\n",
" PERSON\n",
"\n",
", was the last active ruler of \n",
"\n",
" the Ptolemaic Kingdom of Egypt\n",
" GPE\n",
"\n",
". She was born in \n",
"\n",
" 69 BCE\n",
" DATE\n",
"\n",
" and ruled \n",
"\n",
" Egypt\n",
" GPE\n",
"\n",
" from \n",
"\n",
" 51 BCE\n",
" DATE\n",
"\n",
" until her death in \n",
"\n",
" 30 BCE\n",
" DATE\n",
"\n",
".
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# All we have to do is process the text using the updated spaCy pipeline\n",
"doc = nlp(text)\n",
"\n",
"print(doc.ents)\n",
"\n",
"displacy.render(doc, style=\"ent\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Much better!\n",
"\n",
"But, what if we don't want to use a model with these labels? Well, this integration works for any [SpanMarker model on the Hugging Face Hub](https://huggingface.co/models?library=span-marker), so we can just pick another one. Let's now also ensure that the model stays on the CPU, just to see how that works. Beyond that, we'll overwrite entities from spaCy's own NER model. This is recommended when the SpanMarker model uses a different label scheme than spaCy, which uses the labels from OntoNotes v5."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"SpanMarker model predictions are being computed on the CPU while CUDA is available. Moving the model to CUDA using `model.cuda()` before performing predictions is heavily recommended to significantly boost prediction speeds.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(Cleopatra VII, Cleopatra the Great, Egypt, Egypt)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Cleopatra VII\n",
" person-politician\n",
"\n",
", also known as \n",
"\n",
" Cleopatra the Great\n",
" person-politician\n",
"\n",
", was the last active ruler of the Ptolemaic Kingdom of \n",
"\n",
" Egypt\n",
" location-GPE\n",
"\n",
". She was born in 69 BCE and ruled \n",
"\n",
" Egypt\n",
" location-GPE\n",
"\n",
" from 51 BCE until her death in 30 BCE.
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nlp.remove_pipe(\"span_marker\")\n",
"nlp.add_pipe(\n",
" \"span_marker\",\n",
" config={\n",
" \"model\": \"tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super\",\n",
" \"device\": \"cpu\",\n",
" \"overwrite_entities\": True,\n",
" },\n",
")\n",
"\n",
"doc = nlp(text)\n",
"print(doc.ents)\n",
"displacy.render(doc, style=\"ent\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary\n",
"To summarize, using SpanMarker with spaCy is as simple as this:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(Amelia Earhart, 'PERSON'),\n",
" (Lockheed, 'ORG'),\n",
" (Vega 5B, 'PRODUCT'),\n",
" (Atlantic, 'LOC'),\n",
" (Paris, 'GPE')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import spacy\n",
"\n",
"nlp = spacy.load(\"en_core_web_sm\", exclude=[\"ner\"])\n",
"nlp.add_pipe(\"span_marker\")\n",
"\n",
"text = \"Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.\"\n",
"doc = nlp(text)\n",
"\n",
"[(entity, entity.label_) for entity in doc.ents]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "span-marker-ner",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}