
All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.



  • Fixed integrations with newer dependency versions, like transformers and huggingface_hub.


  • Deprecated Python 3.8.



  • Added support for BILO tagging schemes.


  • Changed the error when an empty sentence is provided to the tokenizer.

  • Using spaCy nlp.pipe now processes texts sentence-wise, just like for nlp(...).


  • No longer override language metadata from the dataset if the language was also set manually via SpanMarkerModelCardData.

  • No longer crash on predict with ValueError: Failed to concatenate on axis=1 ... if the first sentence in a list of sentences is just one word.



  • Added SpanMarkerModel.generate_model_card() method to get a model card string.

  • Added SpanMarkerModelCardData that should be passed to SpanMarkerModel.from_pretrained with additional information like

    • language, license, model_name, model_id, encoder_name, encoder_id, dataset_name, dataset_id, dataset_revision.

  • Added transformers pipeline support, e.g. pipeline(task="span-marker", model="tomaarsen/span-marker-mbert-base-multinerd").


  • Heavily improved automatic model card generated.

  • Evaluating outside of training now returns per-label outputs instead of only β€œoverall” F1, precision and recall.

  • Warn if the used tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space.

    • If so, then inference of that model will require the punctuation to be split from the words.

  • Improve label normalization speed.

  • Allow you to call SpanMarkerModel.from_pretrained with a pre-initialized SpanMarkerConfig.


  • Deprecated Python 3.7.


  • Fixed tokenization mismatch between training and inference for XLM-RoBERTa models: allows for normal inference of those models.

  • Resolve niche bug when TrainingArguments are not provided.



  • Added an overwrite_entities parameter to the spaCy pipeline component to allow for overwriting spaCy entities.

  • Added .pipe() method to spaCy integration to allow for batched inference.


  • Stop overwriting spaCy entities by default.



  • Allow for immutable TrainingArguments from newer transformers release.



  • Resolved broken license information.



  • Fix crash in spaCy inference when using subsequent whitespace.



  • Added support for using span_marker spaCy pipeline component without importing SpanMarker.



  • Added support for load_in_8bit=True and device_map="auto".



  • Added trained_with_document_context to the SpanMarkerConfig.

    • Added warnings if a model is trained with document-context and evaluated/inferenced without, or vice versa.

  • Added spaCy integration via nlp.add_pipe("span_marker"). See the SpanMarker with spaCy documentation for information.


  • Heavily improved computational efficiency of sample spreading, resulting in notably faster inference speeds.

  • Disable progress bar for inference by default, and add show_progress_bar parameter to SpanMarkerModel.predict.


  • Fixed evaluation method failing when the testing dataset contains two adjacent and identical sentences.



  • Add missing space in model card template.

  • Return nested list if input is a singular list of sentences or a dataset with one sample.



  • Added support for document-level context in training, evaluation and inference.

    • Use it by supplying document_id and sentence_id columns to the Trainer datasets.

    • Tune it by supplying max_prev_context and max_next_context to the SpanMarkerConfig via SpanMarkerModel.from_pretrained(..., max_prev_context=3).

  • Added batch inference support via SpanMarkerModel.predict(..., batch_size=4).


  • Ensure models are in evaluation mode when using SpanMarkerModel.predict.


  • Removed the allow_overlapping optional keyword from SpanMarkerModel.predict



  • Fixed critical issue with incorrect predictions at inputs that require multiple samples.



  • Added a warning for entities that are ignored/skipped due to the maximum entity length or maximum model input length.

  • Added info-level logs displaying the detected labeling scheme (IOB/IOB2, BIOES, BILOU, none).

  • Added a warning suggesting to use model.cuda() when predictions are performed on a CPU while CUDA is available.

  • Added try_cuda method to SpanMarkerModel which tries to place the model on CUDA and does nothing if that fails.


  • Updated where in the input IDs the span markers are stored, results in 40% training and inferencing speed increase.

  • Updated default marker_max_length in SpanMarkerConfig from 256 to 128.

  • Updated default entity_max_length in SpanMarkerConfig from 16 to 8.

  • Add support for datasets<2.6.0.

  • Add warning if a <v1.0.0 model is loaded using v1.0.0 or newer.

  • Propagate SpanMarkerModel.from_pretrained kwargs to the encoder its AutoModel.from_pretrained.

  • Ignore UndefinedMetricWarning when evaluation f1 is 0.

  • Improved model card generation.


  • Resolved tricky issue causing models to learn to never predict the last token as an entity (Closes #1).

  • Fixed label normalization for BILOU datasets.

[0.2.2] - 2023-04-13ΒΆ


  • Correctly propagate SpanMarkerModel.from_pretrained kwargs to the config initialisation.

[0.2.1] - 2023-04-07ΒΆ


  • Save span_marker_version in config files from now on.


  • SpanMarkerModel.save_pretrained and SpanMarkerModel.push_to_hub now also pushes the tokenizer and a simple model card.

[0.2.0] - 2023-04-06ΒΆ


  • Added missing docstrings.


  • Updated how entity span indices are returned for SpanMarkerModel.predict.


  • Prevent incorrect labels when loading a model trained with a schemed (e.g. IOB, BIOES) dataset.

  • Fix several bugs with loading finetuned SpanMarker models.

  • Add missing methods to SpanMarkerTokenizer.

  • Fix endless recursion bug when providing a compute_metrics to the Trainer.

[0.1.1] - 2023-03-31ΒΆ


  • Prevent crash when args not supplied to Trainer.

  • Prevent crash on evaluation when using fp16=True as a Training Argument.

[0.1.0] - 2023-03-30ΒΆ


  • Implement initial working version.