acres.preprocess package¶

Package containing modules for pre-processing the corpus and a resource factory to easily access pre-processed files.

Submodules¶

acres.preprocess.dumps module¶

Module to process the corpus training data and create data structures for speed-up retrieval.

acres.preprocess.dumps.create_corpus_ngramstat_dump(corpus_path, min_freq, min_length=1, max_length=7)[source]¶

Takes a corpus consisting of text files in a single directory Substitutes digits and line breaks It requires that all documents are in UTF-8 text. It can perform substitutions of digits.

Parameters

corpus_path (str) –
min_freq (int) –
min_length (int) –
max_length (int) –

Return type

Dict[str, int]

Returns

acres.preprocess.dumps.create_indexed_ngrams(ngrams)[source]¶

Create an indexed version of a ngram list. This basically adds an unique identifier to every (str, int) tuple.

Parameters: ngrams (Dict[str, int]) –
Return type: Dict[int, Tuple[int, str]]
Returns

acres.preprocess.resource_factory module¶

Resource factory. This module provides methods for lazily loading resources.

acres.preprocess.resource_factory.get_center_map(partition=0)[source]¶

Lazy load the fast n-gram center map model.

Return type: CenterMap
Returns

acres.preprocess.resource_factory.get_context_map(partition=0)[source]¶

Lazy load the fast n-gram context map model.

Return type: ContextMap
Returns

acres.preprocess.resource_factory.get_dictionary()[source]¶

Lazy load the sense inventory.

Return type: Dict[str, List[str]]
Returns

acres.preprocess.resource_factory.get_ngramstat()[source]¶

Lazy load an indexed representation of ngrams.

Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.

Return type: Dict[int, Tuple[int, str]]
Returns: A dictionary of identifiers mapped to ngrams. Ngrams are tuples with the frequency and the corresponding ngram.

acres.preprocess.resource_factory.get_nn_model(ngram_size=3, min_count=1, net_size=100, alpha=0.025, sg=0, hs=0, negative=5)[source]¶

Lazy load a word2vec model.

Parameters

ngram_size (int) –
min_count (int) –
net_size (int) –
alpha (float) –
sg (int) –
hs (int) –
negative (int) –

Return type

Word2Vec

Returns

acres.preprocess.resource_factory.get_word_ngrams()[source]¶

Lazy load a not-indexed representation of ngrams.

Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.

Return type: Dict[str, int]
Returns

acres.preprocess.resource_factory.reset()[source]¶

Resets global variables to force model recreation.

Return type: None
Returns

acres.preprocess.resource_factory.warmup_cache()[source]¶

Warms up the cache of pickle and txt files by calling all the methods.

Return type: None
Returns

acres.preprocess.resource_factory.write_txt(resource, filename)[source]¶

Writes a tab-separated represenation of a dictionary into a file specified by filename.

Parameters

resource (Dict[str, int]) –
filename (str) –

Return type

int

Returns