acres.stats package

Package with modules to collect statistics from the gold-standard (senses), the training corpus (stats), and a fixed sense inventory (dictionary).

Submodules

acres.stats.dictionary module

Module to collect metrics from a sense inventory. This module can be used to debug the sense inventory e.g. by detecting extreme expansions. It can also be used to debug methods that relies on real data.

acres.stats.dictionary.analyze_file(filename)[source]

Analyzes a given dictionary file for extreme cases.

Parameters

filename (str) –

Return type

None

Returns

acres.stats.dictionary.edit_distance_generated_acro(acro, full)[source]

Calculates the edit distance between the original acronym and the generated acronym out of the full form.

Parameters
  • acro (str) –

  • full (str) –

Return type

Optional[Tuple]

Returns

acres.stats.dictionary.expand(acronym, left_context='', right_context='')[source]
Parameters
  • acronym (str) –

  • left_context (str) –

  • right_context (str) –

Return type

List[str]

Returns

acres.stats.dictionary.parse(filename)[source]

Parse a tab-separated sense inventory as a Python dictionary.

Parameters

filename (str) –

Return type

Dict[str, List[str]]

Returns

acres.stats.dictionary.ratio_acro_words(acro, full)[source]

Calculates the ratio of acronym lenfth to the number of words in the full form.

Parameters
  • acro (str) –

  • full (str) –

Return type

Tuple

Returns

acres.stats.dictionary.show_extremes(txt, lst, lowest_n=10, highest_n=10)[source]
Parameters
  • txt (str) –

  • lst (List) –

  • lowest_n (int) –

  • highest_n (int) –

Return type

None

Returns

acres.stats.senses module

Module to estimate acronym ambiguity. It can be used to collect common acronym statistics, such as senses/acronym.

acres.stats.senses.bucketize(acronyms)[source]

Reduce: calculate the number of different acronyms for each degree of ambiguity.

Parameters

acronyms (Dict[str, Set[str]]) –

Return type

Dict[int, int]

Returns

acres.stats.senses.get_sense_buckets(filename)[source]

Parses a reference standard and get a map of senses per acronym.

Parameters

filename (str) –

Return type

Dict[str, Set[str]]

Returns

acres.stats.senses.map_senses_acronym(standard, lenient=False)[source]

Map: collect senses for each acronym.

Parameters
  • standard (Dict[str, Dict[str, int]]) –

  • lenient (bool) – Whether to consider partial matches (1) as a valid sense.

Return type

Dict[str, Set[str]]

Returns

acres.stats.senses.print_ambiguous(filename)[source]

Print ambiguous acronyms, the ones with more than one sense according to the reference standard.

Parameters

filename (str) –

Return type

None

Returns

acres.stats.senses.print_senses(filename)[source]

Print the distribution of senses per acronym.

Parameters

filename (str) –

Return type

None

Returns

acres.stats.senses.print_undefined(filename)[source]

Print undefined acronyms, the ones with no valid sense according to the reference standard.

Parameters

filename (str) –

Return type

None

Returns

acres.stats.stats module

Module for calculating corpus statistics. It is used to measure the training/test dataset according to, e.g., number of tokens.

class acres.stats.stats.Stats[source]

Bases: object

Class that generates and holds stats about a given text.

calc_stats(text)[source]

Calculates statistics for a given text string and sets the results as variables.

Parameters

text (str) –

Return type

None

Returns

static count_acronyms(text)[source]

Count the number of acronyms in a string.

Acronyms are as defined by the acronym.is_acronym() function.

Parameters

text (str) –

Return type

int

Returns

static count_acronyms_types(text)[source]

Count the number of unique acronyms in a string.

Acronyms are as defined by the acronym.is_acronym() function.

Parameters

text (str) –

Return type

int

Returns

static count_chars(text)[source]

Count the number of non-whitespace chars in a string.

Parameters

text (str) –

Return type

int

Returns

static count_sentences(text)[source]

Count the number of sentences in a string.

Sentences are any string separated by line_separator.

Parameters

text (str) –

Return type

int

Returns

static count_tokens(text)[source]

Count the number of all tokens in a string.

Parameters

text (str) –

Return type

int

Returns

static count_types(text)[source]

Count the number of unique tokens (types) in a string.

Parameters

text (str) –

Return type

int

Returns

source_line_separator = '\n'
acres.stats.stats.get_stats(corpus_path)[source]

Generates all statistics from a given corpus directory.

Parameters

corpus_path (str) –

Return type

List[Stats]

Returns

A list of statistics objects, one for each file found in the corpus dir, plus an extra one for the full corpus.

acres.stats.stats.print_stats()[source]

Generates and print statistics from the default corpus set in config.

Return type

None

Returns

None