acres.stats package¶
Package with modules to collect statistics from the gold-standard (senses), the training corpus (stats), and a fixed sense inventory (dictionary).
Submodules¶
acres.stats.dictionary module¶
Module to collect metrics from a sense inventory. This module can be used to debug the sense inventory e.g. by detecting extreme expansions. It can also be used to debug methods that relies on real data.
-
acres.stats.dictionary.
analyze_file
(filename)[source]¶ Analyzes a given dictionary file for extreme cases.
- Parameters
filename (
str
) –- Return type
None
- Returns
-
acres.stats.dictionary.
edit_distance_generated_acro
(acro, full)[source]¶ Calculates the edit distance between the original acronym and the generated acronym out of the full form.
- Parameters
acro (
str
) –full (
str
) –
- Return type
Optional
[Tuple
]- Returns
-
acres.stats.dictionary.
expand
(acronym, left_context='', right_context='')[source]¶ - Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –
- Return type
List
[str
]- Returns
-
acres.stats.dictionary.
parse
(filename)[source]¶ Parse a tab-separated sense inventory as a Python dictionary.
- Parameters
filename (
str
) –- Return type
Dict
[str
,List
[str
]]- Returns
acres.stats.senses module¶
Module to estimate acronym ambiguity. It can be used to collect common acronym statistics, such as senses/acronym.
-
acres.stats.senses.
bucketize
(acronyms)[source]¶ Reduce: calculate the number of different acronyms for each degree of ambiguity.
- Parameters
acronyms (
Dict
[str
,Set
[str
]]) –- Return type
Dict
[int
,int
]- Returns
-
acres.stats.senses.
get_sense_buckets
(filename)[source]¶ Parses a reference standard and get a map of senses per acronym.
- Parameters
filename (
str
) –- Return type
Dict
[str
,Set
[str
]]- Returns
-
acres.stats.senses.
map_senses_acronym
(standard, lenient=False)[source]¶ Map: collect senses for each acronym.
- Parameters
standard (
Dict
[str
,Dict
[str
,int
]]) –lenient (
bool
) – Whether to consider partial matches (1) as a valid sense.
- Return type
Dict
[str
,Set
[str
]]- Returns
-
acres.stats.senses.
print_ambiguous
(filename)[source]¶ Print ambiguous acronyms, the ones with more than one sense according to the reference standard.
- Parameters
filename (
str
) –- Return type
None
- Returns
acres.stats.stats module¶
Module for calculating corpus statistics. It is used to measure the training/test dataset according to, e.g., number of tokens.
-
class
acres.stats.stats.
Stats
[source]¶ Bases:
object
Class that generates and holds stats about a given text.
-
calc_stats
(text)[source]¶ Calculates statistics for a given text string and sets the results as variables.
- Parameters
text (
str
) –- Return type
None
- Returns
-
static
count_acronyms
(text)[source]¶ Count the number of acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_acronyms_types
(text)[source]¶ Count the number of unique acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_chars
(text)[source]¶ Count the number of non-whitespace chars in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_sentences
(text)[source]¶ Count the number of sentences in a string.
Sentences are any string separated by line_separator.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_tokens
(text)[source]¶ Count the number of all tokens in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_types
(text)[source]¶ Count the number of unique tokens (types) in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
-
source_line_separator
= '\n'¶
-