span_marker.tokenizer module¶
- class span_marker.tokenizer.EntityTracker(entity_max_length, model_max_length, split='train', total_num_entities=0, skipped_entities=<factory>, enabled=False)[source]¶
Bases:
object
For giving a warning about what percentage of entities are ignored/skipped.
Example:
This SpanMarker model won't be able to predict 5.930931% of all annotated entities in the evaluation dataset. This is caused by the SpanMarkerModel maximum entity length of 6 words and the maximum model input length of 64 tokens. These are the frequencies of the missed entities due to maximum entity length out of 1332 total entities: - 7 missed entities with 7 words (0.525526%) - 2 missed entities with 8 words (0.150150%) - 2 missed entities with 9 words (0.150150%) - 2 missed entities with 13 words (0.150150%) Additionally, a total of 66 (4.954955%) entities were missed due to the maximum input length.
- Parameters:
- add(num_entities)[source]¶
Add to the counter of total number of entities.
- Parameters:
num_entities (int) – How many entities to increment by.
- Return type:
None
- class span_marker.tokenizer.SpanMarkerTokenizer(tokenizer, config, **kwargs)[source]¶
Bases:
object
- Parameters:
tokenizer (PreTrainedTokenizer) –
config (SpanMarkerConfig) –