acres.stats package¶
Package with modules to collect statistics from the gold-standard (senses), the training corpus (stats), and a fixed sense inventory (dictionary).
Submodules¶
acres.stats.dictionary module¶
Module to collect metrics from a sense inventory. This module can be used to debug the sense inventory e.g. by detecting extreme expansions. It can also be used to debug methods that relies on real data.
-
acres.stats.dictionary.analyze_file(filename)[source]¶ Analyzes a given dictionary file for extreme cases.
- Parameters
filename (
str) –- Return type
None- Returns
-
acres.stats.dictionary.edit_distance_generated_acro(acro, full)[source]¶ Calculates the edit distance between the original acronym and the generated acronym out of the full form.
- Parameters
acro (
str) –full (
str) –
- Return type
Optional[Tuple]- Returns
-
acres.stats.dictionary.expand(acronym, left_context='', right_context='')[source]¶ - Parameters
acronym (
str) –left_context (
str) –right_context (
str) –
- Return type
List[str]- Returns
-
acres.stats.dictionary.parse(filename)[source]¶ Parse a tab-separated sense inventory as a Python dictionary.
- Parameters
filename (
str) –- Return type
Dict[str,List[str]]- Returns
acres.stats.senses module¶
Module to estimate acronym ambiguity. It can be used to collect common acronym statistics, such as senses/acronym.
-
acres.stats.senses.bucketize(acronyms)[source]¶ Reduce: calculate the number of different acronyms for each degree of ambiguity.
- Parameters
acronyms (
Dict[str,Set[str]]) –- Return type
Dict[int,int]- Returns
-
acres.stats.senses.get_sense_buckets(filename)[source]¶ Parses a reference standard and get a map of senses per acronym.
- Parameters
filename (
str) –- Return type
Dict[str,Set[str]]- Returns
-
acres.stats.senses.map_senses_acronym(standard, lenient=False)[source]¶ Map: collect senses for each acronym.
- Parameters
standard (
Dict[str,Dict[str,int]]) –lenient (
bool) – Whether to consider partial matches (1) as a valid sense.
- Return type
Dict[str,Set[str]]- Returns
-
acres.stats.senses.print_ambiguous(filename)[source]¶ Print ambiguous acronyms, the ones with more than one sense according to the reference standard.
- Parameters
filename (
str) –- Return type
None- Returns
acres.stats.stats module¶
Module for calculating corpus statistics. It is used to measure the training/test dataset according to, e.g., number of tokens.
-
class
acres.stats.stats.Stats[source]¶ Bases:
objectClass that generates and holds stats about a given text.
-
calc_stats(text)[source]¶ Calculates statistics for a given text string and sets the results as variables.
- Parameters
text (
str) –- Return type
None- Returns
-
static
count_acronyms(text)[source]¶ Count the number of acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str) –- Return type
int- Returns
-
static
count_acronyms_types(text)[source]¶ Count the number of unique acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str) –- Return type
int- Returns
-
static
count_chars(text)[source]¶ Count the number of non-whitespace chars in a string.
- Parameters
text (
str) –- Return type
int- Returns
-
static
count_sentences(text)[source]¶ Count the number of sentences in a string.
Sentences are any string separated by line_separator.
- Parameters
text (
str) –- Return type
int- Returns
-
static
count_tokens(text)[source]¶ Count the number of all tokens in a string.
- Parameters
text (
str) –- Return type
int- Returns
-
static
count_types(text)[source]¶ Count the number of unique tokens (types) in a string.
- Parameters
text (
str) –- Return type
int- Returns
-
source_line_separator= '\n'¶
-