Document-level context¶

SpanMarker is an accessible yet powerful Python module for training Named Entity Recognition models.

In this tutorial, I’ll show you how to perform training and inference of SpanMarker models using document-level context to improve performance.

Many approaches to NER process individual sentences completely independently of another, even if the sentences originate from the same document. Although this works fine, research has shown that including additional contextual information (i.e. the previous and next sentence(s)) improves the performance of the model. In my own experiments of SpanMarker with CoNLL03, including this document-level contextual information improves the model from a mean F1 of 92.9±0.0 to a mean F1 of 94.1±0.1.

Document-level context in SpanMarker¶

SpanMarker is designed to require only slight changes in the input data to allow for document-level context during training, evaluating and inference. In particular, the only required change is that the input must now be a Dataset with document_id and sentence_id columns.

Training and evaluating¶

For training and evaluation, the dataset must now contain tokens, ner_tags, document_id and sentence_id columns. I’ve prepared two datasets (tomaarsen/conll2003, tomaarsen/conllpp) that I’ve used to train some models. We will have a look at the former to get a feel for how these values are used.

[21]:
from datasets import load_dataset, Dataset

# Load the dataset from the Hub and throw away the non-NER columns
dataset = load_dataset("tomaarsen/conll2003", split="train").remove_columns(("id", "chunk_tags", "pos_tags"))
dataset
[21]:
Dataset({
    features: ['document_id', 'sentence_id', 'tokens', 'ner_tags'],
    num_rows: 14041
})

Let’s have a quick look at the data itself.

[22]:
dataset.select(range(30)).to_pandas()
[22]:
document_id sentence_id tokens ner_tags
0 1 0 [EU, rejects, German, call, to, boycott, Briti... [3, 0, 7, 0, 0, 0, 7, 0, 0]
1 1 1 [Peter, Blackburn] [1, 2]
2 1 2 [BRUSSELS, 1996-08-22] [5, 0]
3 1 3 [The, European, Commission, said, on, Thursday... [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, ...
4 1 4 [Germany, 's, representative, to, the, Europea... [5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, ...
5 1 5 [", We, do, n't, support, any, such, recommend... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 1 6 [He, said, further, scientific, study, was, re... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 1 7 [He, said, a, proposal, last, month, by, EU, F... [0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 2, 0, 0, 0, ...
8 1 8 [Fischler, proposed, EU-wide, measures, after,... [1, 0, 7, 0, 0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0, ...
9 1 9 [But, Fischler, agreed, to, review, his, propo... [0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, ...
10 1 10 [Spanish, Farm, Minister, Loyola, de, Palacio,... [7, 0, 0, 1, 2, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, ...
11 1 11 [.] [0]
12 1 12 [Only, France, and, Britain, backed, Fischler,... [0, 5, 0, 5, 0, 1, 0, 0, 0]
13 1 13 [The, EU, 's, scientific, veterinary, and, mul... [0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
14 1 14 [Sheep, have, long, been, known, to, contract,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, ...
15 1 15 [British, farmers, denied, on, Thursday, there... [7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
16 1 16 [", What, we, have, to, be, extremely, careful... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17 1 17 [Bonn, has, led, efforts, to, protect, public,... [5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
18 1 18 [Germany, imported, 47,600, sheep, from, Brita... [5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0]
19 1 19 [It, brought, in, 4,275, tonnes, of, British, ... [0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0]
20 2 0 [Rare, Hendrix, song, draft, sells, for, almos... [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
21 2 1 [LONDON, 1996-08-22] [5, 0]
22 2 2 [A, rare, early, handwritten, draft, of, a, so... [0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 1, 2, 0, ...
23 2 3 [A, Florida, restaurant, paid, 10,925, pounds,... [0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
24 2 4 [At, the, end, of, a, January, 1967, concert, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 5, 0, ...
25 2 5 [Buyers, also, snapped, up, 16, other, items, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
26 2 6 [They, included, a, black, lacquer, and, mothe... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
27 2 7 [The, guitarist, died, of, a, drugs, overdose,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
28 3 0 [China, says, Taiwan, spoils, atmosphere, for,... [5, 0, 5, 0, 0, 0, 0, 0]
29 3 1 [BEIJING, 1996-08-22] [5, 0]

As you can see, the document_id and sentence_id columns contain integers. The former serves to identify which document the sentence belongs to, while the latter indicates the position of the sentence in the document. Internally, SpanMarker will include adjacent sentences originating from the same document as contextual information. In the SpanMarker configuration, you can set max_prev_context and max_next_context to limit on the number of previous or next sentences to be included as context. By default, these are set to None, allowing the inclusion of as much context as is available until the maximum token length is reached. In practice, these settings are defined like so:

[ ]:
from span_marker import SpanMarkerModel

# An example encoder and example labels
model = SpanMarkerModel.from_pretrained(
    "prajjwal1/bert-tiny",  # Example encoder
    labels=[  # Example labels
        "O",
        "PER",
        "LOC",
    ],
    max_prev_context=2,
    max_next_context=2,
)

Training using this dataset works equivalently as if the document_id and sentence_id columns did not exist. See the Model Training tutorial for more information on how to do that. See also the Trainer documentation.

Inference¶

For inference, the inputs to model.predict must also contain document_id and sentence_id columns, alongside a tokens column that includes either string sentences or lists of tokens. Let’s consider some sample data:

[24]:
# For simplicity, this data is already split into sentences.
# You can use various tools to do this, e.g. spaCy senter or NLTK sent_tokenize
document_one = [
    "Cleopatra VII (70/69 BC - 10 August 30 BC) was Queen of the Ptolemaic Kingdom of Egypt from 51 to 30 BC, and its last active ruler.",
    "A member of the Ptolemaic dynasty, she was a descendant of its founder Ptolemy I Soter, a Macedonian Greek general and companion of Alexander the Great.",
    "After the death of Cleopatra, Egypt became a province of the Roman Empire, marking the end of the last Hellenistic state in the Mediterranean and of the age that had lasted since the reign of Alexander (336-323 BC).",
]

document_two = [
    "The 35-year-old led his country to the 2022 World Cup title in Qatar last year, arguably the crowning triumph in one of the greatest football careers.",
    "And on Thursday, Messi enjoyed another landmark moment by scoring his fastest ever goal.",
    "Messi curled home an exquisite left-footed strike from the edge of the box just 79 seconds into Argentina's friendly against Australia in Beijing - the quickest of his professional career, per South American football's governing body, CONMEBOL.",
]

document_three = [
    "UK firms could gain access to US green funding as part of plans to boost UK and US ties announced by Rishi Sunak and Joe Biden.",
    "The pair unveiled the Atlantic Declaration, to strengthen economic ties between the two countries, at a White House press conference.",
    "The PM said the agreement, which falls short of a full trade deal would bring benefits \"as quickly as possible\".",
    "UK electric car firms may get access to US green tax credits and subsidies.",
    "As the pair unveiled their partnership to bolster economic security, Mr Sunak said the UK-US relationship was an \"indispensable alliance\"."
]

documents = [document_one, document_two, document_three]

Now we have to preprocess this dataset to generate the document_id and sentence_id.

[25]:
data_dict = {
    "tokens": [],
    "document_id": [],
    "sentence_id": [],
}
for document_id, document in enumerate(documents):
    for sentence_id, sentence in enumerate(document):
        data_dict["document_id"].append(document_id)
        data_dict["sentence_id"].append(sentence_id)
        data_dict["tokens"].append(sentence)
dataset = Dataset.from_dict(data_dict)
dataset
[25]:
Dataset({
    features: ['tokens', 'document_id', 'sentence_id'],
    num_rows: 11
})
[31]:
dataset.to_pandas()
[31]:
tokens document_id sentence_id
0 Cleopatra VII (70/69 BC - 10 August 30 BC) was... 0 0
1 A member of the Ptolemaic dynasty, she was a d... 0 1
2 After the death of Cleopatra, Egypt became a p... 0 2
3 The 35-year-old led his country to the 2022 Wo... 1 0
4 And on Thursday, Messi enjoyed another landmar... 1 1
5 Messi curled home an exquisite left-footed str... 1 2
6 UK firms could gain access to US green funding... 2 0
7 The pair unveiled the Atlantic Declaration, to... 2 1
8 The PM said the agreement, which falls short o... 2 2
9 UK electric car firms may get access to US gre... 2 3
10 As the pair unveiled their partnership to bols... 2 4

We can immediately pass this dataset to SpanMarkerModel.predict, and SpanMarker will under the hood add the document-level context for you. Note that the dataset does not need to be sorted. See also the SpanMarkerModel.predict documentation.

[27]:
from span_marker import SpanMarkerModel

model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context").try_cuda()
entities = model.predict(dataset)
len(entities)
[27]:
11
[28]:
entities[0]
[28]:
[{'span': 'Cleopatra VII',
  'label': 'PER',
  'score': 0.7116236686706543,
  'char_start_index': 0,
  'char_end_index': 13},
 {'span': 'BC',
  'label': 'MISC',
  'score': 0.9982840418815613,
  'char_start_index': 21,
  'char_end_index': 23},
 {'span': 'Ptolemaic Kingdom of Egypt',
  'label': 'LOC',
  'score': 0.6176416873931885,
  'char_start_index': 60,
  'char_end_index': 86}]

As you can see, the SpanMarker model returns a list of entity dictionaries for each sentence in the input dataset.