acres.preprocess package¶
Package containing modules for pre-processing the corpus and a resource factory to easily access pre-processed files.
Submodules¶
acres.preprocess.dumps module¶
Module to process the corpus training data and create data structures for speed-up retrieval.
-
acres.preprocess.dumps.
create_corpus_ngramstat_dump
(corpus_path, min_freq, min_length=1, max_length=7)[source]¶ Takes a corpus consisting of text files in a single directory Substitutes digits and line breaks It requires that all documents are in UTF-8 text. It can perform substitutions of digits.
- Parameters
corpus_path (
str
) –min_freq (
int
) –min_length (
int
) –max_length (
int
) –
- Return type
Dict
[str
,int
]- Returns
acres.preprocess.resource_factory module¶
Resource factory. This module provides methods for lazily loading resources.
-
acres.preprocess.resource_factory.
get_center_map
(partition=0)[source]¶ Lazy load the fast n-gram center map model.
- Return type
- Returns
-
acres.preprocess.resource_factory.
get_context_map
(partition=0)[source]¶ Lazy load the fast n-gram context map model.
- Return type
- Returns
-
acres.preprocess.resource_factory.
get_dictionary
()[source]¶ Lazy load the sense inventory.
- Return type
Dict
[str
,List
[str
]]- Returns
-
acres.preprocess.resource_factory.
get_ngramstat
()[source]¶ Lazy load an indexed representation of ngrams.
Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.
- Return type
Dict
[int
,Tuple
[int
,str
]]- Returns
A dictionary of identifiers mapped to ngrams. Ngrams are tuples with the frequency and the corresponding ngram.
-
acres.preprocess.resource_factory.
get_nn_model
(ngram_size=3, min_count=1, net_size=100, alpha=0.025, sg=0, hs=0, negative=5)[source]¶ Lazy load a word2vec model.
- Parameters
ngram_size (
int
) –min_count (
int
) –net_size (
int
) –alpha (
float
) –sg (
int
) –hs (
int
) –negative (
int
) –
- Return type
Word2Vec
- Returns
-
acres.preprocess.resource_factory.
get_word_ngrams
()[source]¶ Lazy load a not-indexed representation of ngrams.
Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.
- Return type
Dict
[str
,int
]- Returns
-
acres.preprocess.resource_factory.
reset
()[source]¶ Resets global variables to force model recreation.
- Return type
None
- Returns