acres.stats package¶
Package with modules to collect statistics from the gold-standard (senses), the training corpus (stats), and a fixed sense inventory (dictionary).
Submodules¶
acres.stats.dictionary module¶
Module to collect metrics from a sense inventory. This module can be used to debug the sense inventory e.g. by detecting extreme expansions. It can also be used to debug methods that relies on real data.
- acres.stats.dictionary.analyze_file(filename)[source]¶
Analyzes a given dictionary file for extreme cases.
- Parameters
filename (
str
) –- Return type
None
- Returns
- acres.stats.dictionary.edit_distance_generated_acro(acro, full)[source]¶
Calculates the edit distance between the original acronym and the generated acronym out of the full form.
- Parameters
acro (
str
) –full (
str
) –
- Return type
Optional
[Tuple
]- Returns
- acres.stats.dictionary.expand(acronym, left_context='', right_context='')[source]¶
- Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –
- Return type
List
[str
]- Returns
- acres.stats.dictionary.parse(filename)[source]¶
Parse a tab-separated sense inventory as a Python dictionary.
- Parameters
filename (
str
) –- Return type
Dict
[str
,List
[str
]]- Returns
acres.stats.senses module¶
Module to estimate acronym ambiguity. It can be used to collect common acronym statistics, such as senses/acronym.
- acres.stats.senses.bucketize(acronyms)[source]¶
Reduce: calculate the number of different acronyms for each degree of ambiguity.
- Parameters
acronyms (
Dict
[str
,Set
[str
]]) –- Return type
Dict
[int
,int
]- Returns
- acres.stats.senses.get_sense_buckets(filename)[source]¶
Parses a reference standard and get a map of senses per acronym.
- Parameters
filename (
str
) –- Return type
Dict
[str
,Set
[str
]]- Returns
- acres.stats.senses.map_senses_acronym(standard, lenient=False)[source]¶
Map: collect senses for each acronym.
- Parameters
standard (
Dict
[str
,Dict
[str
,int
]]) –lenient (
bool
) – Whether to consider partial matches (1) as a valid sense.
- Return type
Dict
[str
,Set
[str
]]- Returns
- acres.stats.senses.print_ambiguous(filename)[source]¶
Print ambiguous acronyms, the ones with more than one sense according to the reference standard.
- Parameters
filename (
str
) –- Return type
None
- Returns
acres.stats.stats module¶
Module for calculating corpus statistics. It is used to measure the training/test dataset according to, e.g., number of tokens.
- class acres.stats.stats.Stats[source]¶
Bases:
object
Class that generates and holds stats about a given text.
- calc_stats(text)[source]¶
Calculates statistics for a given text string and sets the results as variables.
- Parameters
text (
str
) –- Return type
None
- Returns
- static count_acronyms(text)[source]¶
Count the number of acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str
) –- Return type
int
- Returns
- static count_acronyms_types(text)[source]¶
Count the number of unique acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str
) –- Return type
int
- Returns
- static count_chars(text)[source]¶
Count the number of non-whitespace chars in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
- static count_sentences(text)[source]¶
Count the number of sentences in a string.
Sentences are any string separated by line_separator.
- Parameters
text (
str
) –- Return type
int
- Returns
- static count_tokens(text)[source]¶
Count the number of all tokens in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
- static count_types(text)[source]¶
Count the number of unique tokens (types) in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
- source_line_separator = '\n'¶