acres.stats package¶

Package with modules to collect statistics from the gold-standard (senses), the training corpus (stats), and a fixed sense inventory (dictionary).

Submodules¶

acres.stats.dictionary module¶

Module to collect metrics from a sense inventory. This module can be used to debug the sense inventory e.g. by detecting extreme expansions. It can also be used to debug methods that relies on real data.

acres.stats.dictionary.analyze_file(filename)[source]¶

Analyzes a given dictionary file for extreme cases.

Parameters: filename (str) –
Return type: None
Returns

acres.stats.dictionary.edit_distance_generated_acro(acro, full)[source]¶

Calculates the edit distance between the original acronym and the generated acronym out of the full form.

Parameters

acro (str) –
full (str) –

Return type

Optional[Tuple]

Returns

acres.stats.dictionary.expand(acronym, left_context='', right_context='')[source]¶

Parameters

acronym (str) –
left_context (str) –
right_context (str) –

Return type

List[str]

Returns

acres.stats.dictionary.parse(filename)[source]¶

Parse a tab-separated sense inventory as a Python dictionary.

Parameters: filename (str) –
Return type: Dict[str, List[str]]
Returns

acres.stats.dictionary.ratio_acro_words(acro, full)[source]¶

Calculates the ratio of acronym lenfth to the number of words in the full form.

Parameters

acro (str) –
full (str) –

Return type

Tuple

Returns

acres.stats.dictionary.show_extremes(txt, lst, lowest_n=10, highest_n=10)[source]¶

Parameters

txt (str) –
lst (List) –
lowest_n (int) –
highest_n (int) –

Return type

None

Returns

acres.stats.senses module¶

Module to estimate acronym ambiguity. It can be used to collect common acronym statistics, such as senses/acronym.

acres.stats.senses.bucketize(acronyms)[source]¶

Reduce: calculate the number of different acronyms for each degree of ambiguity.

Parameters: acronyms (Dict[str, Set[str]]) –
Return type: Dict[int, int]
Returns

acres.stats.senses.get_sense_buckets(filename)[source]¶

Parses a reference standard and get a map of senses per acronym.

Parameters: filename (str) –
Return type: Dict[str, Set[str]]
Returns

acres.stats.senses.map_senses_acronym(standard, lenient=False)[source]¶

Map: collect senses for each acronym.

Parameters

standard (Dict[str, Dict[str, int]]) –
lenient (bool) – Whether to consider partial matches (1) as a valid sense.

Return type

Dict[str, Set[str]]

Returns

acres.stats.senses.print_ambiguous(filename)[source]¶

Print ambiguous acronyms, the ones with more than one sense according to the reference standard.

Parameters: filename (str) –
Return type: None
Returns

acres.stats.senses.print_senses(filename)[source]¶

Print the distribution of senses per acronym.

Parameters: filename (str) –
Return type: None
Returns

acres.stats.senses.print_undefined(filename)[source]¶

Print undefined acronyms, the ones with no valid sense according to the reference standard.

Parameters: filename (str) –
Return type: None
Returns

acres.stats.stats module¶

Module for calculating corpus statistics. It is used to measure the training/test dataset according to, e.g., number of tokens.

class acres.stats.stats.Stats[source]¶

Bases: object

Class that generates and holds stats about a given text.

calc_stats(text)[source]¶

Calculates statistics for a given text string and sets the results as variables.

Parameters: text (str) –
Return type: None
Returns

static count_acronyms(text)[source]¶

Count the number of acronyms in a string.

Acronyms are as defined by the acronym.is_acronym() function.

Parameters: text (str) –
Return type: int
Returns

static count_acronyms_types(text)[source]¶

Count the number of unique acronyms in a string.

Acronyms are as defined by the acronym.is_acronym() function.

Parameters: text (str) –
Return type: int
Returns

static count_chars(text)[source]¶

Count the number of non-whitespace chars in a string.

Parameters: text (str) –
Return type: int
Returns

static count_sentences(text)[source]¶

Count the number of sentences in a string.

Sentences are any string separated by line_separator.

Parameters: text (str) –
Return type: int
Returns

static count_tokens(text)[source]¶

Count the number of all tokens in a string.

Parameters: text (str) –
Return type: int
Returns

static count_types(text)[source]¶

Count the number of unique tokens (types) in a string.

Parameters: text (str) –
Return type: int
Returns

source_line_separator = '\n'¶

acres.stats.stats.get_stats(corpus_path)[source]¶

Generates all statistics from a given corpus directory.

Parameters: corpus_path (str) –
Return type: List[Stats]
Returns: A list of statistics objects, one for each file found in the corpus dir, plus an extra one for the full corpus.

acres.stats.stats.print_stats()[source]¶

Generates and print statistics from the default corpus set in config.

Return type: None
Returns: None