acres.preprocess package

Package containing modules for pre-processing the corpus and a resource factory to easily access pre-processed files.

Submodules

acres.preprocess.dumps module

Module to process the corpus training data and create data structures for speed-up retrieval.

acres.preprocess.dumps.create_corpus_ngramstat_dump(corpus_path, min_freq, min_length=1, max_length=7)[source]

Takes a corpus consisting of text files in a single directory Substitutes digits and line breaks It requires that all documents are in UTF-8 text. It can perform substitutions of digits.

Parameters
  • corpus_path (str) –

  • min_freq (int) –

  • min_length (int) –

  • max_length (int) –

Return type

Dict[str, int]

Returns

acres.preprocess.dumps.create_indexed_ngrams(ngrams)[source]

Create an indexed version of a ngram list. This basically adds an unique identifier to every (str, int) tuple.

Parameters

ngrams (Dict[str, int]) –

Return type

Dict[int, Tuple[int, str]]

Returns

acres.preprocess.resource_factory module

Resource factory. This module provides methods for lazily loading resources.

acres.preprocess.resource_factory.get_center_map(partition=0)[source]

Lazy load the fast n-gram center map model.

Return type

CenterMap

Returns

acres.preprocess.resource_factory.get_context_map(partition=0)[source]

Lazy load the fast n-gram context map model.

Return type

ContextMap

Returns

acres.preprocess.resource_factory.get_dictionary()[source]

Lazy load the sense inventory.

Return type

Dict[str, List[str]]

Returns

acres.preprocess.resource_factory.get_ngramstat()[source]

Lazy load an indexed representation of ngrams.

Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.

Return type

Dict[int, Tuple[int, str]]

Returns

A dictionary of identifiers mapped to ngrams. Ngrams are tuples with the frequency and the corresponding ngram.

acres.preprocess.resource_factory.get_nn_model(ngram_size=3, min_count=1, net_size=100, alpha=0.025, sg=0, hs=0, negative=5)[source]

Lazy load a word2vec model.

Parameters
  • ngram_size (int) –

  • min_count (int) –

  • net_size (int) –

  • alpha (float) –

  • sg (int) –

  • hs (int) –

  • negative (int) –

Return type

Word2Vec

Returns

acres.preprocess.resource_factory.get_word_ngrams()[source]

Lazy load a not-indexed representation of ngrams.

Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.

Return type

Dict[str, int]

Returns

acres.preprocess.resource_factory.reset()[source]

Resets global variables to force model recreation.

Return type

None

Returns

acres.preprocess.resource_factory.warmup_cache()[source]

Warms up the cache of pickle and txt files by calling all the methods.

Return type

None

Returns

acres.preprocess.resource_factory.write_txt(resource, filename)[source]

Writes a tab-separated represenation of a dictionary into a file specified by filename.

Parameters
  • resource (Dict[str, int]) –

  • filename (str) –

Return type

int

Returns