span_marker.tokenizer module¶

class span_marker.tokenizer.EntityTracker(entity_max_length, model_max_length, split='train', total_num_entities=0, skipped_entities=<factory>, enabled=False)[source]¶

Bases: object

For giving a warning about what percentage of entities are ignored/skipped.

Example:

This SpanMarker model won't be able to predict 5.930931% of all annotated entities in the evaluation dataset.
This is caused by the SpanMarkerModel maximum entity length of 6 words and the maximum model input length of 64 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 1332 total entities:
- 7 missed entities with 7 words (0.525526%)
- 2 missed entities with 8 words (0.150150%)
- 2 missed entities with 9 words (0.150150%)
- 2 missed entities with 13 words (0.150150%)
Additionally, a total of 66 (4.954955%) entities were missed due to the maximum input length.
Parameters:
  • entity_max_length (int) –

  • model_max_length (int) –

  • split (str) –

  • total_num_entities (int) –

  • skipped_entities (Dict[int, int]) –

  • enabled (bool) –

entity_max_length: int¶
model_max_length: int¶
split: str = 'train'¶
total_num_entities: int = 0¶
skipped_entities: Dict[int, int]¶
enabled: bool = False¶
add(num_entities)[source]¶

Add to the counter of total number of entities.

Parameters:

num_entities (int) – How many entities to increment by.

Return type:

None

missed(length)[source]¶

Add to the counter of missed/ignored/skipped entities.

Parameters:

length (int) – How many entities were missed.

Return type:

None

reset()[source]¶

Reset to defaults, stops tracking.

Return type:

None

class span_marker.tokenizer.SpanMarkerTokenizer(tokenizer, config, **kwargs)[source]¶

Bases: object

Parameters:
get_all_valid_spans(num_words, entity_max_length)[source]¶
Parameters:
  • num_words (int) –

  • entity_max_length (int) –

Return type:

Iterator[Tuple[int, int]]

get_all_valid_spans_and_labels(num_words, span_to_label, entity_max_length, outside_id)[source]¶
Parameters:
  • num_words (int) –

  • span_to_label (Dict[Tuple[int, int], int]) –

  • entity_max_length (int) –

  • outside_id (int) –

Return type:

Iterator[Tuple[Tuple[int, int], int]]

classmethod from_pretrained(pretrained_model_name_or_path, *inputs, config=None, **kwargs)[source]¶
Parameters:

pretrained_model_name_or_path (str | PathLike) –