span_marker.modeling module¶

class span_marker.modeling.SpanMarkerModel(config, encoder=None, model_card_data=None, **kwargs)[source]¶

Bases: PreTrainedModel

This SpanMarker model allows for Named Entity Recognition (NER) using a variety of underlying encoders, such as BERT and RoBERTa. The model should be initialized using from_pretrained(), e.g. like so:

>>> # Initialize a SpanMarkerModel using a pretrained encoder
>>> model = SpanMarkerModel.from_pretrained("bert-base-cased", labels=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", ...])
>>> # Load a pretrained SpanMarker model
>>> model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")

After the model is loaded (and finetuned if it wasn’t already), it can be used to predict entities:

>>> model.predict("A prototype was fitted in the mid-'60s in a one-off DB5 extended 4'' after the doors and "
... "driven by Marek personally, and a normally 6-cylinder Aston Martin DB7 was equipped with a V8 unit in 1998.")
[{'span': 'DB5', 'label': 'product-car', 'score': 0.8675689101219177, 'char_start_index': 52, 'char_end_index': 55},
 {'span': 'Marek', 'label': 'person-other', 'score': 0.9100819230079651, 'char_start_index': 99, 'char_end_index': 104},
 {'span': 'Aston Martin DB7', 'label': 'product-car', 'score': 0.9931442737579346, 'char_start_index': 143, 'char_end_index': 159}]
Parameters:
forward(input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels=None, num_words=None, document_ids=None, sentence_ids=None, **kwargs)[source]¶

Forward call of the SpanMarkerModel.

Parameters:
  • input_ids (Tensor) – Input IDs including start/end markers.

  • attention_mask (Tensor) – Attention mask matrix including one-directional attention for markers.

  • position_ids (Tensor) – Position IDs including start/end markers.

  • start_marker_indices (Tensor) – The indices where the start markers begin per batch sample.

  • num_marker_pairs (Tensor) – The number of start/end marker pairs per batch sample.

  • labels (Optional[Tensor]) – The labels for each span candidate. Defaults to None.

  • num_words (Optional[Tensor]) – The number of words for each batch sample. Defaults to None.

  • document_ids (Optional[Tensor]) – The document ID of each batch sample. Defaults to None.

  • sentence_ids (Optional[Tensor]) – The index of each sentence in their respective document. Defaults to None.

Returns:

The output dataclass.

Return type:

SpanMarkerOutput

classmethod from_pretrained(pretrained_model_name_or_path, *model_args, labels=None, config=None, model_card_data=None, **kwargs)[source]¶

Instantiate a pretrained pytorch model from a pre-trained model configuration.

Example

>>> # Initialize a SpanMarkerModel using a pretrained encoder
>>> model = SpanMarkerModel.from_pretrained("bert-base-cased", labels=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", ...])
>>> # Load a pretrained SpanMarker model
>>> model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
Parameters:
  • pretrained_model_name_or_path (Union[str, os.PathLike]) –

    Either a pretrained encoder (e.g. bert-base-cased, roberta-large, etc.), or a pretrained SpanMarkerModel. Can be either:

    • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

    • A path to a directory containing model weights saved using SpanMarkerModel.save_pretrained(), e.g., ./my_model_directory/.

    • A path or url to a tensorflow index checkpoint file (e.g, ./tf_model/model.ckpt.index). In this case, from_tf should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.

    • A path or url to a model folder containing a flax checkpoint file in .msgpack format (e.g, ./flax_model/ containing flax_model.msgpack). In this case, from_flax should be set to True.

  • labels (List[str], optional) – A list of string labels corresponding to the ner_tags in your datasets. Only necessary when loading a SpanMarker model using a pretrained encoder. Defaults to None.

  • config (SpanMarkerConfig | None) –

  • model_card_data (SpanMarkerModelCardData | None) –

Return type:

T

Additional arguments are passed to SpanMarkerConfig and the from_pretrained methods of AutoConfig, AutoModel and SpanMarkerTokenizer.

Returns:

A SpanMarkerModel instance, either ready for training using the Trainer or for inference via SpanMarkerModel.predict().

Return type:

SpanMarkerModel

Parameters:
predict(inputs, batch_size=4, show_progress_bar=False)[source]¶

Predict named entities from input texts.

Example:

>>> model = SpanMarkerModel.from_pretrained(...)
>>> model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
[{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7629689574241638, 'char_start_index': 0, 'char_end_index': 14},
 {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9833564758300781, 'char_start_index': 38, 'char_end_index': 54},
 {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7621214389801025, 'char_start_index': 66, 'char_end_index': 74},
 {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9807717204093933, 'char_start_index': 78, 'char_end_index': 83}]
>>> model.predict(['Caesar', 'led', 'the', 'Roman', 'armies', 'in', 'the', 'Gallic', 'Wars', 'before', 'defeating', 'his', 'political', 'rival', 'Pompey', 'in', 'a', 'civil', 'war'])
[{'span': ['Caesar'], 'label': 'person-politician', 'score': 0.683479905128479, 'word_start_index': 0, 'word_end_index': 1},
 {'span': ['Roman'], 'label': 'location-GPE', 'score': 0.7114525437355042, 'word_start_index': 3, 'word_end_index': 4},
 {'span': ['Gallic', 'Wars'], 'label': 'event-attack/battle/war/militaryconflict', 'score': 0.9015670418739319, 'word_start_index': 7, 'word_end_index': 9},
 {'span': ['Pompey'], 'label': 'person-politician', 'score': 0.9601260423660278, 'word_start_index': 14, 'word_end_index': 15}]
Parameters:
  • inputs (Union[str, List[str], List[List[str]], Dataset]) –

    Input sentences from which to extract entities. Valid datastructures are:

    • str: a string sentence.

    • List[str]: a pre-tokenized string sentence, i.e. a list of words.

    • List[str]: a list of multiple string sentences.

    • List[List[str]]: a list of multiple pre-tokenized string sentences, i.e. a list with lists of words.

    • Dataset: A 🤗 Dataset with a tokens column and optionally document_id and sentence_id columns.

      If the optional columns are provided, they will be used to provide document-level context.

  • batch_size (int) – The number of samples to include in a batch, a higher batch size is faster, but requires more memory. Defaults to 4

  • show_progress_bar (bool) – Whether to show a progress bar, useful for longer inputs. Defaults to False.

Returns:

If the input is a single sentence, then we output a list of dictionaries. Each dictionary represents one predicted entity, and contains the following keys:

  • label: The predicted entity label.

  • span: The text that the model deems an entity.

  • score: The model its confidence.

  • word_start_index & word_end_index: The word indices for the start/end of the entity, if the input is pre-tokenized.

  • char_start_index & char_end_index: The character indices for the start/end of the entity, if the input is a string.

If the input is multiple sentences, then we return a list containing multiple of the aforementioned lists.

Return type:

Union[List[Dict[str, Union[str, int, float]]], List[List[Dict[str, Union[str, int, float]]]]]

save_pretrained(save_directory, is_main_process=True, state_dict=None, save_function=<function save>, push_to_hub=False, max_shard_size='10GB', safe_serialization=False, variant=None, **kwargs)[source]¶
Parameters:
  • save_directory (str | PathLike) –

  • is_main_process (bool) –

  • state_dict (dict | None) –

  • save_function (Callable) –

  • push_to_hub (bool) –

  • max_shard_size (int | str) –

  • safe_serialization (bool) –

  • variant (str | None) –

Return type:

None

generate_model_card()[source]¶

Generate and return a model card string based on the model card data.

Returns:

The model card string.

Return type:

str

try_cuda(device=None)[source]¶

Try to moves all model parameters and buffers to the GPU, do nothing if failed.

Note

This method modifies the module in-place.

Parameters:

device (int, optional) – if specified, all parameters will be copied to that device

Returns:

self

Return type:

Module

SpanMarkerModel.push_to_hub(repo_id, use_temp_dir=None, commit_message=None, private=None, token=None, max_shard_size='5GB', create_pr=False, safe_serialization=True, revision=None, commit_description=None, tags=None, **deprecated_kwargs)¶

Upload the model file to the 🤗 Model Hub.

Parameters:
  • repo_id (str) – The name of the repository you want to push your model to. It should contain your organization name when pushing to a given organization.

  • use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.

  • commit_message (str, optional) – Message to commit while pushing. Will default to “Upload model”.

  • private (bool, optional) – Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists.

  • token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface). Will default to True if repo_url is not specified.

  • max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.

  • create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.

  • safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.

  • revision (str, optional) – Branch to push the uploaded files to.

  • commit_description (str, optional) – The description of the commit that will be created

  • tags (List[str], optional) – List of tags to push on the Hub.

Return type:

str

Examples:

```python from transformers import AutoModel

model = AutoModel.from_pretrained(“google-bert/bert-base-cased”)

# Push the model to your namespace with the name “my-finetuned-bert”. model.push_to_hub(“my-finetuned-bert”)

# Push the model to an organization with the name “my-finetuned-bert”. model.push_to_hub(“huggingface/my-finetuned-bert”) ```

property SpanMarkerModel.device: device¶

The device on which the module is (assuming that all the module parameters are on the same device).

Type:

torch.device

SpanMarkerModel.cuda(device=None)¶

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters:
  • device (int, optional) – if specified, all parameters will be copied to that device

  • self (T) –

Returns:

self

Return type:

Module

SpanMarkerModel.cpu()¶

Move all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

self

Return type:

Module

Parameters:

self (T) –

SpanMarkerModel.train(mode=True)¶

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters:
  • mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

  • self (T) –

Returns:

self

Return type:

Module

SpanMarkerModel.eval()¶

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

self

Return type:

Module

Parameters:

self (T) –