acres¶
acres is an acronym expansion module based on word embeddings and filtering rules.
We provided here auto-generated module by module documentation only.
Module documentation¶
acres package¶
Root package.
Subpackages¶
acres.evaluation package¶
Package containing evaluation modules.
Submodules¶
acres.evaluation.evaluation module¶
Benchmark code. It’s the main entry point for comparing strategies using evaluation metrics such as precision, recall, and F1-score.
-
class
acres.evaluation.evaluation.
Level
(value)[source]¶ Bases:
enum.Enum
Enum that holds acronym-solving levels.
-
TOKEN
= 1¶
-
TYPE
= 2¶
-
-
acres.evaluation.evaluation.
analyze
(contextualized_acronym, true_expansions, strategy, max_tries)[source]¶ Analyze a given row of the gold standard.
-
acres.evaluation.evaluation.
do_analysis
(topics_file, detection_file, expansion_file, strategy, level, max_tries, lenient)[source]¶ Analyze a given expansion standard.
- Parameters
- Return type
- Returns
A tuple with lists containing correct, found, and valid contextualized acronyms
-
acres.evaluation.evaluation.
evaluate
(topics, valid_standard, standard, strategy, level, max_tries, lenient)[source]¶ Analyze a gold standard with text excerpts centered on an acronym, followed by n valid expansions.
- Parameters
- Return type
- Returns
A tuple with lists containing correct, found, and valid contextualized acronyms
-
acres.evaluation.evaluation.
plot_data
(topics_file, detection_file, expansion_file)[source]¶ Run all strategies using different ranks and lenient approaches and generate a TSV file to be used as input for the plots.R script.
- Parameters
topics_file (
str
) –detection_file (
str
) –expansion_file (
str
) –
- Returns
-
acres.evaluation.evaluation.
summary
(topics_file, detection_file, expansion_file, level, max_tries, lenient)[source]¶ Save a summary table in TSV format that can be used to run statistical tests (e.g. McNemar Test)
- Parameters
topics_file (
str
) –detection_file (
str
) –expansion_file (
str
) –level (
Level
) –max_tries (
int
) –lenient (
bool
) –
- Returns
-
acres.evaluation.evaluation.
test_input
(true_expansions, possible_expansions, max_tries=10)[source]¶ Test an acronym + context strings against the model.
- Parameters
true_expansions (
Set
[str
]) –possible_expansions (
List
[str
]) – An ordered list of possible expansions.max_tries (
int
) – Maxinum number of tries
- Return type
bool
- Returns
acres.evaluation.metrics module¶
Helper functions to calculate evaluation metrics.
-
acres.evaluation.metrics.
calculate_f1
(precision, recall)[source]¶ Calculates the F1-score.
- Parameters
precision (
float
) –recall (
float
) –
- Return type
float
- Returns
acres.fastngram package¶
Package containing a full in-memory implementation of n-gram matching.
Submodules¶
acres.fastngram.fastngram module¶
A faster version of n-gram matching that uses dictionaries for speed-up.
-
class
acres.fastngram.fastngram.
CenterMap
[source]¶ Bases:
object
A map of center words to contexts.
-
class
acres.fastngram.fastngram.
ContextMap
[source]¶ Bases:
object
A map of contexts to center words.
-
acres.fastngram.fastngram.
baseline
(acronym, left_context='', right_context='')[source]¶ A baseline method that expands only with unigrams.
- Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –
- Return type
Iterator
[str
]- Returns
-
acres.fastngram.fastngram.
create_map
(ngrams, model, partition=0)[source]¶ Create a search-optimized represenation of an ngram-list.
- Parameters
ngrams (
Dict
[str
,int
]) –model (
Union
[ContextMap
,CenterMap
]) –partition (
int
) –
- Return type
Union
[ContextMap
,CenterMap
]- Returns
-
acres.fastngram.fastngram.
fastngram
(acronym, left_context='', right_context='', min_freq=2, max_rank=100000)[source]¶ Find an unlimited set of expansion candidates for an acronym given its left and right context. Note that no filtering is done here, except from the acronym initial partioning.
- Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –min_freq (
int
) –max_rank (
int
) –
- Return type
Iterator
[str
]- Returns
-
acres.fastngram.fastngram.
fasttype
(acronym, left_context='', right_context='', min_freq=2, max_rank=100000)[source]¶ Find an unlimited set of expansion candidates given the training contexts of the acronym. Note that no filtering is done here, except from the acronym initial partioning.
- Parameters
acronym (
str
) –left_context (
str
) – Not used.right_context (
str
) – Not used.min_freq (
int
) –max_rank (
int
) –
- Return type
Iterator
[str
]- Returns
acres.model package¶
Package containing domain models (from the MVC design pattern).
Submodules¶
acres.model.detection_standard module¶
Model class that represents a detection standard. A detection standard works like a allow/block list to filter out inputs from the topic list that are not proper acronyms (e.g. BEFUND, III). Such inputs are then not considered for evaluation purposes.
It is designed as an append-only list (i.e., entries do not need to be updated with variable inputs).
-
acres.model.detection_standard.
filter_valid
(standard)[source]¶ Filter out invalid entries from a gold standard. Invalid entries are not proper acronyms or repeated types.
- Parameters
standard (
Dict
[str
,bool
]) –- Return type
Set
[str
]- Returns
-
acres.model.detection_standard.
parse
(filename)[source]¶ Parses a .tsv-formatted detection standard into a dictionary.
- Parameters
filename (
str
) –- Return type
Dict
[str
,bool
]- Returns
-
acres.model.detection_standard.
parse_valid
(filename)[source]¶ Wrapper method for both parse and filter_valid.
- Parameters
filename (
str
) –- Return type
Set
[str
]- Returns
acres.model.expansion_standard module¶
Model class that represents an expansion standard. An expansion standard is the main reference standard containing acronyms-expansion pairs and their evaluation following the TREC standard (2/1/0).
It is designed as an append-only list (i.e., entries do not need to be updated with variable inputs).
-
acres.model.expansion_standard.
parse
(filename)[source]¶ Parse a TSV-separated expansion standard into a dictionary.
- Parameters
filename (
str
) –- Return type
Dict
[str
,Dict
[str
,int
]]- Returns
A dictionary with acronyms pointing to expansions and an assessment value.
-
acres.model.expansion_standard.
write
(filename, previous, valid, topics)[source]¶ Write results in the TREC format, one candidate expansion per line.
- Parameters
filename (
str
) –previous (
Dict
[str
,Dict
[str
,int
]]) – A dictionary of acronyms mapped to their senses and assesments (if any).valid (
Set
[str
]) – A set of valid acronyms, normally parsed from a detection standard.topics (
List
[Acronym
]) – A topic list.
- Return type
None
- Returns
acres.model.ngrams module¶
Module to handle n-gram lists.
-
class
acres.model.ngrams.
FilteredNGramStat
(ngram_size)[source]¶ Bases:
object
Filtered NGramStat generator
This generator generates ngrams of a given size out of a ngramstat.txt file, while respecting each ngram frequency.
@todo ngramstat itself should be a generator
-
PRINT_INTERVAL
= 1000000¶
-
TOKEN_SEPARATOR
= ' '¶
-
acres.model.topic_list module¶
Model class that represents a topic list. A topic list is used as main input (a la TREC) and thus can control which acronyms (together with their contexts) are to be considered for evaluation. A topic list can be used, e.g., to quickly switch between different evaluation scenarios such as acronyms collected from either the training or test dataset.
-
acres.model.topic_list.
create
(filename, chance, ngram_size=7)[source]¶ Create a topic list out of random n-grams with a given chance and size.
- Parameters
filename (
str
) –chance (
float
) –ngram_size (
int
) –
- Returns
acres.preprocess package¶
Package containing modules for pre-processing the corpus and a resource factory to easily access pre-processed files.
Submodules¶
acres.preprocess.dumps module¶
Module to process the corpus training data and create data structures for speed-up retrieval.
-
acres.preprocess.dumps.
create_corpus_ngramstat_dump
(corpus_path, min_freq, min_length=1, max_length=7)[source]¶ Takes a corpus consisting of text files in a single directory Substitutes digits and line breaks It requires that all documents are in UTF-8 text. It can perform substitutions of digits.
- Parameters
corpus_path (
str
) –min_freq (
int
) –min_length (
int
) –max_length (
int
) –
- Return type
Dict
[str
,int
]- Returns
acres.preprocess.resource_factory module¶
Resource factory. This module provides methods for lazily loading resources.
-
acres.preprocess.resource_factory.
get_center_map
(partition=0)[source]¶ Lazy load the fast n-gram center map model.
- Return type
- Returns
-
acres.preprocess.resource_factory.
get_context_map
(partition=0)[source]¶ Lazy load the fast n-gram context map model.
- Return type
- Returns
-
acres.preprocess.resource_factory.
get_dictionary
()[source]¶ Lazy load the sense inventory.
- Return type
Dict
[str
,List
[str
]]- Returns
-
acres.preprocess.resource_factory.
get_ngramstat
()[source]¶ Lazy load an indexed representation of ngrams.
Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.
- Return type
Dict
[int
,Tuple
[int
,str
]]- Returns
A dictionary of identifiers mapped to ngrams. Ngrams are tuples with the frequency and the corresponding ngram.
-
acres.preprocess.resource_factory.
get_nn_model
(ngram_size=3, min_count=1, net_size=100, alpha=0.025, sg=0, hs=0, negative=5)[source]¶ Lazy load a word2vec model.
- Parameters
ngram_size (
int
) –min_count (
int
) –net_size (
int
) –alpha (
float
) –sg (
int
) –hs (
int
) –negative (
int
) –
- Return type
Word2Vec
- Returns
-
acres.preprocess.resource_factory.
get_word_ngrams
()[source]¶ Lazy load a not-indexed representation of ngrams.
Loading order is as follows: 1. Variable; 2. Pickle file; 3. Generation.
- Return type
Dict
[str
,int
]- Returns
-
acres.preprocess.resource_factory.
reset
()[source]¶ Resets global variables to force model recreation.
- Return type
None
- Returns
acres.rater package¶
Package with rating modules. Rating modules are used to filter out candidate expansions provided by expansion strategies.
Submodules¶
acres.rater.expansion module¶
Rating submodule for expansion (acronym + full form) checks.
acres.rater.full module¶
Rating submodule for full form checks.
acres.rater.rater module¶
Rating main module.
-
acres.rater.rater.
get_acronym_score
(acro, full)[source]¶ Scores acronym/resolution pairs according to a series of well-formedness criteria.
This scoring function should be used only for cleaned and normalized full forms.
For forms that may contain acronym-definition pairs, see get_acronym_definition_pair_score. For forms that should be checked for variants, see get_acronym_score_variants.
TODO Consider again morphosaurus checks.
TODO Full form should not be an acronym itself.
- Parameters
acro (
str
) – Acronym to be expanded.full (
str
) – Long form to be checked whether it qualifies as an acronym expansion.
- Return type
float
- Returns
score that rates the likelihood that the full form is a valid expansion of the acronym.
acres.resolution package¶
Package with a facade to the several expansion strategies.
Submodules¶
acres.resolution.resolver module¶
Facade to the several expansion strategies.
-
class
acres.resolution.resolver.
Strategy
(value)[source]¶ Bases:
enum.IntEnum
Enum that holds acronym-solving strategies.
-
BASELINE
= 5¶
-
DICTIONARY
= 3¶
-
FASTNGRAM
= 4¶
-
FASTTYPE
= 6¶
-
WORD2VEC
= 2¶
-
-
acres.resolution.resolver.
filtered_resolve
(acronym, left_context, right_context, strategy)[source]¶ Resolve a given acronym + context using the provided Strategy and filter out invalid expansions.
- Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –strategy (
Strategy
) –
- Return type
Iterator
[str
]- Returns
acres.stats package¶
Package with modules to collect statistics from the gold-standard (senses), the training corpus (stats), and a fixed sense inventory (dictionary).
Submodules¶
acres.stats.dictionary module¶
Module to collect metrics from a sense inventory. This module can be used to debug the sense inventory e.g. by detecting extreme expansions. It can also be used to debug methods that relies on real data.
-
acres.stats.dictionary.
analyze_file
(filename)[source]¶ Analyzes a given dictionary file for extreme cases.
- Parameters
filename (
str
) –- Return type
None
- Returns
-
acres.stats.dictionary.
edit_distance_generated_acro
(acro, full)[source]¶ Calculates the edit distance between the original acronym and the generated acronym out of the full form.
- Parameters
acro (
str
) –full (
str
) –
- Return type
Optional
[Tuple
]- Returns
-
acres.stats.dictionary.
expand
(acronym, left_context='', right_context='')[source]¶ - Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –
- Return type
List
[str
]- Returns
-
acres.stats.dictionary.
parse
(filename)[source]¶ Parse a tab-separated sense inventory as a Python dictionary.
- Parameters
filename (
str
) –- Return type
Dict
[str
,List
[str
]]- Returns
acres.stats.senses module¶
Module to estimate acronym ambiguity. It can be used to collect common acronym statistics, such as senses/acronym.
-
acres.stats.senses.
bucketize
(acronyms)[source]¶ Reduce: calculate the number of different acronyms for each degree of ambiguity.
- Parameters
acronyms (
Dict
[str
,Set
[str
]]) –- Return type
Dict
[int
,int
]- Returns
-
acres.stats.senses.
get_sense_buckets
(filename)[source]¶ Parses a reference standard and get a map of senses per acronym.
- Parameters
filename (
str
) –- Return type
Dict
[str
,Set
[str
]]- Returns
-
acres.stats.senses.
map_senses_acronym
(standard, lenient=False)[source]¶ Map: collect senses for each acronym.
- Parameters
standard (
Dict
[str
,Dict
[str
,int
]]) –lenient (
bool
) – Whether to consider partial matches (1) as a valid sense.
- Return type
Dict
[str
,Set
[str
]]- Returns
-
acres.stats.senses.
print_ambiguous
(filename)[source]¶ Print ambiguous acronyms, the ones with more than one sense according to the reference standard.
- Parameters
filename (
str
) –- Return type
None
- Returns
acres.stats.stats module¶
Module for calculating corpus statistics. It is used to measure the training/test dataset according to, e.g., number of tokens.
-
class
acres.stats.stats.
Stats
[source]¶ Bases:
object
Class that generates and holds stats about a given text.
-
calc_stats
(text)[source]¶ Calculates statistics for a given text string and sets the results as variables.
- Parameters
text (
str
) –- Return type
None
- Returns
-
static
count_acronyms
(text)[source]¶ Count the number of acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_acronyms_types
(text)[source]¶ Count the number of unique acronyms in a string.
Acronyms are as defined by the acronym.is_acronym() function.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_chars
(text)[source]¶ Count the number of non-whitespace chars in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_sentences
(text)[source]¶ Count the number of sentences in a string.
Sentences are any string separated by line_separator.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_tokens
(text)[source]¶ Count the number of all tokens in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
-
static
count_types
(text)[source]¶ Count the number of unique tokens (types) in a string.
- Parameters
text (
str
) –- Return type
int
- Returns
-
source_line_separator
= '\n'¶
-
acres.util package¶
Package with general utilities modules.
Submodules¶
acres.util.acronym module¶
Utility functions related to acronyms.
-
class
acres.util.acronym.
Acronym
(acronym, left_context, right_context)¶ Bases:
tuple
-
property
acronym
¶ Alias for field number 0
-
property
left_context
¶ Alias for field number 1
-
property
right_context
¶ Alias for field number 2
-
property
-
acres.util.acronym.
create_german_acronym
(full)[source]¶ Creates an acronym out of a given multi-word expression.
@todo Use is_stopword?
- Parameters
full (
str
) – A full form containing whitespaces.- Return type
str
- Returns
acres.util.functions module¶
Module with general functions.
-
acres.util.functions.
create_ngram_statistics
(input_string, n_min, n_max)[source]¶ Creates a dictionary that counts each nGram in an input string. Delimiters are spaces.
Example: bigrams and trigrams nMin = 2 , nMax = 3 PROBE: # print(WordNgramStat(‘a ab aa a a a ba ddd’, 1, 4))
- Parameters
input_string (
str
) –n_min (
int
) –n_max (
int
) –
- Return type
Dict
[str
,int
]- Returns
-
acres.util.functions.
import_conf
(key)[source]¶ - Parameters
key (
str
) –- Return type
Optional
[str
]- Returns
-
acres.util.functions.
is_stopword
(str_in)[source]¶ Tests whether word is stopword, according to list.
For German, source http://snowball.tartarus.org/algorithms/german/stop.txt
- Parameters
str_in (
str
) –- Return type
bool
- Returns
-
acres.util.functions.
partition
(word, partitions)[source]¶ Find a bucket for a given word.
- Parameters
word (
str
) –partitions (
int
) –
- Return type
int
- Returns
acres.util.text module¶
Utility functions related to text processing.
-
acres.util.text.
clean
(text, preserve_linebreaks=False)[source]¶ Clean a given text to preserve only alphabetic characters, spaces, and, optionally, line breaks.
- Parameters
text (
str
) –preserve_linebreaks (
bool
) –
- Return type
str
- Returns
-
acres.util.text.
clean_whitespaces
(whitespaced)[source]¶ Clean up an input string of repeating and trailing whitespaces.
- Parameters
whitespaced (
str
) –- Return type
str
- Returns
-
acres.util.text.
clear_digits
(str_in, substitute_char)[source]¶ Substitutes all digits by a character (or string)
Example: ClearDigits(“Vitamin B12”, “°”):
TODO rewrite as regex
- Parameters
str_in (
str
) –substitute_char (
str
) –
- Return type
str
acres.word2vec package¶
Package grouping modules related to the word2vec expansion strategy.
Submodules¶
acres.word2vec.test module¶
Module to apply/test a given word2vec model.
-
acres.word2vec.test.
find_candidates
(acronym, left_context='', right_context='', min_distance=0.0, max_rank=500)[source]¶ Similar to robust_find_embeddings, this finds possible expansions of a given acronym.
- Parameters
acronym (
str
) –left_context (
str
) –right_context (
str
) –min_distance (
float
) –max_rank (
int
) –
- Return type
Iterator
[str
]- Returns
acres.word2vec.train module¶
Trainer for word2vec embeddings based on an idea originally proposed by Johannes Hellrich (https://github.com/JULIELab/hellrich_dh2016).
Submodules¶
acres.constants module¶
Module with global constants.