acres.util package

Package with general utilities modules.

Submodules

acres.util.acronym module

Utility functions related to acronyms.

class acres.util.acronym.Acronym(acronym, left_context, right_context)

Bases: tuple

property acronym

Alias for field number 0

property left_context

Alias for field number 1

property right_context

Alias for field number 2

acres.util.acronym.create_german_acronym(full)[source]

Creates an acronym out of a given multi-word expression.

@todo Use is_stopword?

Parameters

full (str) – A full form containing whitespaces.

Return type

str

Returns

acres.util.acronym.is_acronym(str_probe, max_length=7)[source]

Identifies Acronyms, restricted by absolute length XXX look for “authoritative” definitions for acronyms

Parameters
  • str_probe (str) –

  • max_length (int) –

Return type

bool

Returns

acres.util.acronym.trim_plural(acronym)[source]

Trim the german plural form out of an acronym.

@todo rewrite as regex

Parameters

acronym (str) –

Return type

str

Returns

acres.util.functions module

Module with general functions.

acres.util.functions.create_ngram_statistics(input_string, n_min, n_max)[source]

Creates a dictionary that counts each nGram in an input string. Delimiters are spaces.

Example: bigrams and trigrams nMin = 2 , nMax = 3 PROBE: # print(WordNgramStat(‘a ab aa a a a ba ddd’, 1, 4))

Parameters
  • input_string (str) –

  • n_min (int) –

  • n_max (int) –

Return type

Dict[str, int]

Returns

acres.util.functions.import_conf(key)[source]
Parameters

key (str) –

Return type

Optional[str]

Returns

acres.util.functions.is_stopword(str_in)[source]

Tests whether word is stopword, according to list.

For German, source http://snowball.tartarus.org/algorithms/german/stop.txt

Parameters

str_in (str) –

Return type

bool

Returns

acres.util.functions.partition(word, partitions)[source]

Find a bucket for a given word.

Parameters
  • word (str) –

  • partitions (int) –

Return type

int

Returns

acres.util.functions.robust_text_import_from_dir(path)[source]

Read the content of valid text files from a path into a list of strings.

Parameters

path (str) – The path to look for documents.

Return type

List[str]

Returns

A list of strings containing the content of each valid file.

acres.util.functions.sample(iterable, chance)[source]

Randomly sample items from an iterable with a given chance.

Parameters
  • iterable (Iterable) –

  • chance (float) –

Return type

Iterable

Returns

acres.util.text module

Utility functions related to text processing.

acres.util.text.clean(text, preserve_linebreaks=False)[source]

Clean a given text to preserve only alphabetic characters, spaces, and, optionally, line breaks.

Parameters
  • text (str) –

  • preserve_linebreaks (bool) –

Return type

str

Returns

acres.util.text.clean_whitespaces(whitespaced)[source]

Clean up an input string of repeating and trailing whitespaces.

Parameters

whitespaced (str) –

Return type

str

Returns

acres.util.text.clear_digits(str_in, substitute_char)[source]

Substitutes all digits by a character (or string)

Example: ClearDigits(“Vitamin B12”, “°”):

TODO rewrite as regex

Parameters
  • str_in (str) –

  • substitute_char (str) –

Return type

str

acres.util.text.reduce_repeated_chars(str_in, char, remaining_chars)[source]
Parameters
  • str_in (str) – text to be cleaned

  • char (str) – character that should not occur more than remaining_chars times in sequence

  • remaining_chars (int) – remaining_chars

Return type

str

Returns

acres.util.text.remove_duplicated_whitespaces(whitespaced)[source]

Clean up an input string out of any number of repeated whitespaces.

Parameters

whitespaced (str) –

Return type

str

Returns