SpanMarker for Named Entity Recognition

SpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and ELECTRA. Tightly implemented on top of the 🤗 Transformers library, SpanMarker can take advantage of its valuable functionality.

Based on the PL-Marker paper, SpanMarker breaks the mold through its accessibility and ease of use. Crucially, SpanMarker works out of the box with many common encoders such as bert-base-cased and roberta-large, and automatically works with datasets using the IOB, IOB2, BIOES, BILOU or no label annotation scheme.

Check out all publicly available SpanMarker models on the Hugging Face Hub here. Alternatively, check out any model from this list of particularly useful models:

Model ID

Domain

Label Count

Language

tomaarsen/span-marker-mbert-base-multinerd

lxyuan/span-marker-bert-base-multilingual-uncased-multinerd

lxyuan/span-marker-bert-base-multilingual-cased-multinerd

General

15

Multilingual

tomaarsen/span-marker-bert-base-fewnerd-fine-super

tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super

General

66

English

Multilingual

tomaarsen/span-marker-bert-base-cross-ner

tomaarsen/span-marker-bert-base-uncased-cross-ner

General

39

English

tomaarsen/span-marker-roberta-large-ontonotes5

General

18

English

tomaarsen/span-marker-bert-base-uncased-keyphrase-inspec

Keyphrases

1

English

tomaarsen/span-marker-bert-base-acronyms

tomaarsen/span-marker-bert-base-uncased-acronyms

Acronyms

2

English

tomaarsen/span-marker-bert-base-ncbi-disease

tomaarsen/span-marker-bert-base-uncased-bionlp

Biomedical

1

5

English

stefan-it/span-marker-gelectra-large-germeval14

gwlms/span-marker-teams-germeval14

gwlms/span-marker-token-dropping-bert-germeval14

gwlms/span-marker-bert-germeval14

General

12

German

Context

Argilla

I have developed this library as a part of my thesis work at Argilla. Feel free to ⭐ star or watch the SpanMarker repository to get notified when my thesis is published.

Quick Reference

How to Train

from datasets import load_dataset
from span_marker import SpanMarkerModel, Trainer
from transformers import TrainingArguments

# The dataset labels can have a tagging schemed (IOB, IOB2, BIOES),
# but that is not necessary. This dataset has no tagging scheme:
dataset = load_dataset("DFKI-SLT/few-nerd", "supervised")
labels = ["O", "art", "building", "event", "location", "organization", "other", "person", "product"]

# Initialize a SpanMarkerModel using an encoder, e.g. BERT, and the labels:
encoder_id = "bert-base-cased"
model = SpanMarkerModel.from_pretrained(encoder_id, labels=labels)

# See the 🤗 TrainingArguments documentation for details here
args = TrainingArguments(
    output_dir="my_span_marker_model",
    learning_rate=5e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    save_strategy="steps",
    eval_steps=200,
    logging_steps=50,
    warmup_ratio=0.1,
)

# Our Trainer subclasses the 🤗 Trainer, and the usage is very similar
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"].select(range(8000)),
    eval_dataset=dataset["validation"].select(range(2000)),
)

# Training is really simple using our Trainer!
trainer.train()

# ... and so is evaluating!
metrics = trainer.evaluate()
print(metrics)

# Save the model locally or on the Hugging Face Hub
trainer.save_model("my_span_marker_model/checkpoint-final")
trainer.push_to_hub("my_span_marker_model/checkpoint-final")

See Initializing & Training for more details, or check out the documentation for SpanMarkerModel, Trainer, load_dataset(), or TrainingArguments.

How to predict

from span_marker import SpanMarkerModel

# Load a finetuned SpanMarkerModel from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")

# It is recommended to explicitly move the model to CUDA for faster inference, if possible
model.cuda()

model.predict("A prototype was fitted in the mid-'60s in a one-off DB5 extended 4'' after the doors and driven by Marek personally, and a normally 6-cylinder Aston Martin DB7 was equipped with a V8 unit in 1998.")
[{'span': 'DB5', 'label': 'product-car', 'score': 0.8675689101219177, 'char_start_index': 52, 'char_end_index': 55},
 {'span': 'Marek', 'label': 'person-other', 'score': 0.9100819230079651, 'char_start_index': 99, 'char_end_index': 104},
 {'span': 'Aston Martin DB7', 'label': 'product-car', 'score': 0.9931442737579346, 'char_start_index': 143, 'char_end_index': 159}]

Note

You can also load a locally saved model through SpanMarkerModel.from_pretrained("path/to/model"), much like in 🤗 Transformers.

See Loading & Inferencing for more details, or check out the documentation for SpanMarkerModel or predict().

How to save a model

Locally

model.save_pretrained("my_model_dir")

See the documentation for save_pretrained() for more details.

To the 🤗 Hub

model_id = "span-marker-bert-base-fewnerd-fine-super"
model.push_to_hub(model_id)

See the documentation for push_to_hub() for more details.