span_marker.modeling module¶
- class span_marker.modeling.SpanMarkerModel(config, encoder=None, model_card_data=None, **kwargs)[source]¶
Bases:
PreTrainedModel
This SpanMarker model allows for Named Entity Recognition (NER) using a variety of underlying encoders, such as BERT and RoBERTa. The model should be initialized using
from_pretrained()
, e.g. like so:>>> # Initialize a SpanMarkerModel using a pretrained encoder >>> model = SpanMarkerModel.from_pretrained("bert-base-cased", labels=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", ...]) >>> # Load a pretrained SpanMarker model >>> model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
After the model is loaded (and finetuned if it wasn’t already), it can be used to predict entities:
>>> model.predict("A prototype was fitted in the mid-'60s in a one-off DB5 extended 4'' after the doors and " ... "driven by Marek personally, and a normally 6-cylinder Aston Martin DB7 was equipped with a V8 unit in 1998.") [{'span': 'DB5', 'label': 'product-car', 'score': 0.8675689101219177, 'char_start_index': 52, 'char_end_index': 55}, {'span': 'Marek', 'label': 'person-other', 'score': 0.9100819230079651, 'char_start_index': 99, 'char_end_index': 104}, {'span': 'Aston Martin DB7', 'label': 'product-car', 'score': 0.9931442737579346, 'char_start_index': 143, 'char_end_index': 159}]
- Parameters:
config (SpanMarkerConfig) –
encoder (PreTrainedModel | None) –
model_card_data (SpanMarkerModelCardData | None) –
- forward(input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels=None, num_words=None, document_ids=None, sentence_ids=None, **kwargs)[source]¶
Forward call of the SpanMarkerModel.
- Parameters:
input_ids (Tensor) – Input IDs including start/end markers.
attention_mask (Tensor) – Attention mask matrix including one-directional attention for markers.
position_ids (Tensor) – Position IDs including start/end markers.
start_marker_indices (Tensor) – The indices where the start markers begin per batch sample.
num_marker_pairs (Tensor) – The number of start/end marker pairs per batch sample.
labels (Optional[Tensor]) – The labels for each span candidate. Defaults to None.
num_words (Optional[Tensor]) – The number of words for each batch sample. Defaults to None.
document_ids (Optional[Tensor]) – The document ID of each batch sample. Defaults to None.
sentence_ids (Optional[Tensor]) – The index of each sentence in their respective document. Defaults to None.
- Returns:
The output dataclass.
- Return type:
- classmethod from_pretrained(pretrained_model_name_or_path, *model_args, labels=None, config=None, model_card_data=None, **kwargs)[source]¶
Instantiate a pretrained pytorch model from a pre-trained model configuration.
Example
>>> # Initialize a SpanMarkerModel using a pretrained encoder >>> model = SpanMarkerModel.from_pretrained("bert-base-cased", labels=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", ...]) >>> # Load a pretrained SpanMarker model >>> model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
- Parameters:
pretrained_model_name_or_path (Union[str, os.PathLike]) –
Either a pretrained encoder (e.g.
bert-base-cased
,roberta-large
, etc.), or a pretrained SpanMarkerModel. Can be either:A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like
bert-base-uncased
, or namespaced under a user or organization name, likedbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
SpanMarkerModel.save_pretrained()
, e.g.,./my_model_directory/
.A path or url to a tensorflow index checkpoint file (e.g,
./tf_model/model.ckpt.index
). In this case,from_tf
should be set toTrue
and a configuration object should be provided asconfig
argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.A path or url to a model folder containing a flax checkpoint file in .msgpack format (e.g,
./flax_model/
containingflax_model.msgpack
). In this case,from_flax
should be set toTrue
.
labels (List[str], optional) – A list of string labels corresponding to the
ner_tags
in your datasets. Only necessary when loading a SpanMarker model using a pretrained encoder. Defaults to None.config (SpanMarkerConfig | None) –
model_card_data (SpanMarkerModelCardData | None) –
- Return type:
T
Additional arguments are passed to
SpanMarkerConfig
and thefrom_pretrained
methods ofAutoConfig
,AutoModel
andSpanMarkerTokenizer
.- Returns:
A
SpanMarkerModel
instance, either ready for training using theTrainer
or for inference viaSpanMarkerModel.predict()
.- Return type:
- Parameters:
config (SpanMarkerConfig | None) –
model_card_data (SpanMarkerModelCardData | None) –
- predict(inputs, batch_size=4, show_progress_bar=False)[source]¶
Predict named entities from input texts.
Example:
>>> model = SpanMarkerModel.from_pretrained(...) >>> model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") [{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7629689574241638, 'char_start_index': 0, 'char_end_index': 14}, {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9833564758300781, 'char_start_index': 38, 'char_end_index': 54}, {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7621214389801025, 'char_start_index': 66, 'char_end_index': 74}, {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9807717204093933, 'char_start_index': 78, 'char_end_index': 83}] >>> model.predict(['Caesar', 'led', 'the', 'Roman', 'armies', 'in', 'the', 'Gallic', 'Wars', 'before', 'defeating', 'his', 'political', 'rival', 'Pompey', 'in', 'a', 'civil', 'war']) [{'span': ['Caesar'], 'label': 'person-politician', 'score': 0.683479905128479, 'word_start_index': 0, 'word_end_index': 1}, {'span': ['Roman'], 'label': 'location-GPE', 'score': 0.7114525437355042, 'word_start_index': 3, 'word_end_index': 4}, {'span': ['Gallic', 'Wars'], 'label': 'event-attack/battle/war/militaryconflict', 'score': 0.9015670418739319, 'word_start_index': 7, 'word_end_index': 9}, {'span': ['Pompey'], 'label': 'person-politician', 'score': 0.9601260423660278, 'word_start_index': 14, 'word_end_index': 15}]
- Parameters:
inputs (Union[str, List[str], List[List[str]], Dataset]) –
Input sentences from which to extract entities. Valid datastructures are:
str: a string sentence.
List[str]: a pre-tokenized string sentence, i.e. a list of words.
List[str]: a list of multiple string sentences.
List[List[str]]: a list of multiple pre-tokenized string sentences, i.e. a list with lists of words.
- Dataset: A 🤗
Dataset
with atokens
column and optionallydocument_id
andsentence_id
columns. If the optional columns are provided, they will be used to provide document-level context.
- Dataset: A 🤗
batch_size (int) – The number of samples to include in a batch, a higher batch size is faster, but requires more memory. Defaults to 4
show_progress_bar (bool) – Whether to show a progress bar, useful for longer inputs. Defaults to False.
- Returns:
If the input is a single sentence, then we output a list of dictionaries. Each dictionary represents one predicted entity, and contains the following keys:
label
: The predicted entity label.span
: The text that the model deems an entity.score
: The model its confidence.word_start_index
&word_end_index
: The word indices for the start/end of the entity, if the input is pre-tokenized.char_start_index
&char_end_index
: The character indices for the start/end of the entity, if the input is a string.
If the input is multiple sentences, then we return a list containing multiple of the aforementioned lists.
- Return type:
Union[List[Dict[str, Union[str, int, float]]], List[List[Dict[str, Union[str, int, float]]]]]
- save_pretrained(save_directory, is_main_process=True, state_dict=None, save_function=<function save>, push_to_hub=False, max_shard_size='10GB', safe_serialization=False, variant=None, **kwargs)[source]¶
- SpanMarkerModel.push_to_hub(repo_id, use_temp_dir=None, commit_message=None, private=None, token=None, max_shard_size='5GB', create_pr=False, safe_serialization=True, revision=None, commit_description=None, tags=None, **deprecated_kwargs)¶
Upload the model file to the 🤗 Model Hub.
- Parameters:
repo_id (str) – The name of the repository you want to push your model to. It should contain your organization name when pushing to a given organization.
use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.
commit_message (str, optional) – Message to commit while pushing. Will default to “Upload model”.
private (bool, optional) – Whether or not the repository created should be private.
token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface). Will default to True if repo_url is not specified.
max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.
create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.
safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.
revision (str, optional) – Branch to push the uploaded files to.
commit_description (str, optional) – The description of the commit that will be created
tags (List[str], optional) – List of tags to push on the Hub.
- Return type:
Examples:
```python from transformers import AutoModel
model = AutoModel.from_pretrained(“bert-base-cased”)
# Push the model to your namespace with the name “my-finetuned-bert”. model.push_to_hub(“my-finetuned-bert”)
# Push the model to an organization with the name “my-finetuned-bert”. model.push_to_hub(“huggingface/my-finetuned-bert”) ```
- property SpanMarkerModel.device: device¶
The device on which the module is (assuming that all the module parameters are on the same device).
- Type:
torch.device
- SpanMarkerModel.cuda(device=None)¶
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
Note
This method modifies the module in-place.
- Parameters:
device (int, optional) – if specified, all parameters will be copied to that device
self (T) –
- Returns:
self
- Return type:
Module
- SpanMarkerModel.cpu()¶
Move all model parameters and buffers to the CPU.
Note
This method modifies the module in-place.
- Returns:
self
- Return type:
Module
- Parameters:
self (T) –
- SpanMarkerModel.train(mode=True)¶
Set the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,BatchNorm
, etc.- Parameters:
mode (bool) – whether to set training mode (
True
) or evaluation mode (False
). Default:True
.self (T) –
- Returns:
self
- Return type:
Module
- SpanMarkerModel.eval()¶
Set the module in evaluation mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,BatchNorm
, etc.This is equivalent with
self.train(False)
.See Locally disabling gradient computation for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Returns:
self
- Return type:
Module
- Parameters:
self (T) –