span_marker.tokenizer module¶

class span_marker.tokenizer.EntityTracker(entity_max_length, model_max_length, split='train', total_num_entities=0, skipped_entities=<factory>, enabled=False)[source]¶

Bases: object

For giving a warning about what percentage of entities are ignored/skipped.

Example:

This SpanMarker model won't be able to predict 5.930931% of all annotated entities in the evaluation dataset.
This is caused by the SpanMarkerModel maximum entity length of 6 words and the maximum model input length of 64 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 1332 total entities:
- 7 missed entities with 7 words (0.525526%)
- 2 missed entities with 8 words (0.150150%)
- 2 missed entities with 9 words (0.150150%)
- 2 missed entities with 13 words (0.150150%)
Additionally, a total of 66 (4.954955%) entities were missed due to the maximum input length.

Parameters:

entity_max_length (int) –
model_max_length (int) –
split (str) –
total_num_entities (int) –
skipped_entities (Dict[int, int]) –
enabled (bool) –

entity_max_length: int¶

model_max_length: int¶

split: str = 'train'¶

total_num_entities: int = 0¶

skipped_entities: Dict[int, int]¶

enabled: bool = False¶

add(num_entities)[source]¶

Add to the counter of total number of entities.

Parameters:: num_entities (int) – How many entities to increment by.
Return type:: None

missed(length)[source]¶

Add to the counter of missed/ignored/skipped entities.

Parameters:: length (int) – How many entities were missed.
Return type:: None

reset()[source]¶

Reset to defaults, stops tracking.

Return type:: None

class span_marker.tokenizer.SpanMarkerTokenizer(tokenizer, config, **kwargs)[source]¶

Bases: object

Parameters:

tokenizer (PreTrainedTokenizer) –
config (SpanMarkerConfig) –

get_all_valid_spans(num_words, entity_max_length)[source]¶

Parameters:

num_words (int) –
entity_max_length (int) –

Return type:

Iterator[Tuple[int, int]]

get_all_valid_spans_and_labels(num_words, span_to_label, entity_max_length, outside_id)[source]¶

Parameters:

num_words (int) –
span_to_label (Dict[Tuple[int, int], int]) –
entity_max_length (int) –
outside_id (int) –

Return type:

Iterator[Tuple[Tuple[int, int], int]]

classmethod from_pretrained(pretrained_model_name_or_path, *inputs, config=None, **kwargs)[source]¶

Parameters:: pretrained_model_name_or_path (str | PathLike) –