public class IncrementalSemanticAnalysis extends Object implements SemanticSpace
ISA is notable in that it builds semantics incrementally using both information from the co-occurrence of a word and the semantics of the co-occurring word. Similar to Random Indexing (RI), ISA uses index vectors to reduce the number of dimensions needed to represent the full co-occurrence matrix. In contrast, other semantic space algorithms such as RI, HAL and BEAGLE, ISA uses the semantics of the co-occurring words to update the semantics of their neighbors. Formally, the semantics of a word wi are updated for the co-occurrence of another word wj as:
This class defines the following configurable properties that may be set
using either the System properties or using the IncrementalSemanticAnalysis(Properties)
constructor. The two most important properties for configuring ISA are
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.impactRate" and "edu.ucla.sspace.isa.IncrementalSemanticAnalysis.historyDecayRate".
The values that these properties set have been initialized to the values
specified in Baroni et al.
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.impactRate"
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.historyDecayRate"
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.windowSize"
5
words are counted before and 5
words are counter
after. This class always uses a symmetric window.
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.vectorLength"
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.usePermutations"
false
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.permutationFunction"
DefaultPermutationFunction
PermutationFunction
instance that will be used
to permute index vectors. If the "edu.ucla.sspace.isa.IncrementalSemanticAnalysis.usePermutations" is
set to false
, the value of this property has no effect.
"edu.ucla.sspace.isa.IncrementalSemanticAnalysis.sparseSemantics"
false
Due to the incremental nature of ISA, instance of this class are not designed to be multi-threaded. Documents must be processed sequentially to properly model how the semantics of co-occurring words affect each other. Multi-threading would induce an ambiguous ordering to co-occurrence.
Modifier and Type | Field and Description |
---|---|
static double |
DEFAULT_HISTORY_DECAY_RATE
The default rate at which the history (semantics) decays when affecting
other co-occurring word's semantics.
|
static double |
DEFAULT_IMPACT_RATE
The default rate at which the co-occurrence of a word affects the
semantics.
|
static int |
DEFAULT_VECTOR_LENGTH
The default number of dimensions to be used by the index and semantic
vectors.
|
static int |
DEFAULT_WINDOW_SIZE
The default number of words to view before and after each word in focus.
|
static String |
HISTORY_DECAY_RATE_PROPERTY
The property to specify the decay rate for determing how much the history
(semantics) of a word will affect the semantics of co-occurring words.
|
static String |
IMPACT_RATE_PROPERTY
The property to specify the impact rate of word co-occurrence.
|
static String |
PERMUTATION_FUNCTION_PROPERTY
The property to specify the fully qualified named of a
PermutationFunction if using permutations is enabled. |
static String |
USE_PERMUTATIONS_PROPERTY
The property to specify whether the index vectors for co-occurrent words
should be permuted based on their relative position.
|
static String |
USE_SPARSE_SEMANTICS_PROPERTY
Specifies whether to use a sparse encoding for each word's semantics,
which saves space when words do not co-occur with many unique tokens, but
requires more computation.
|
static String |
VECTOR_LENGTH_PROPERTY
The property to specify the number of dimensions to be used by the index
and semantic vectors.
|
static String |
WINDOW_SIZE_PROPERTY
The property to specify the number of words to view before and after each
word in focus.
|
Constructor and Description |
---|
IncrementalSemanticAnalysis()
Creates a new
IncrementalSemanticAnalysis instance using the
current System properties for configuration. |
IncrementalSemanticAnalysis(Properties properties)
Creates a new
IncrementalSemanticAnalysis instance using the
provided properties for configuration. |
Modifier and Type | Method and Description |
---|---|
void |
clearSemantics()
Removes all associations between word and semantics while still retaining
the word to index vector mapping.
|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
Vector |
getVector(String word)
Returns the semantic vector for the provided word.
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
Map<String,TernaryVector> |
getWordToIndexVector()
Returns an unmodifiable view on the token to
IntegerVector
mapping used by this instance. |
void |
processDocument(BufferedReader document)
Processes the contents of the provided file as a document.
|
void |
processSpace(Properties properties)
Does nothing, as ISA in an incremental algorithm and no final processing
needs to be performed on the space.
|
void |
setWordToIndexVector(Map<String,TernaryVector> m)
Assigns the token to
IntegerVector mapping to be used by this
instance. |
public static final String HISTORY_DECAY_RATE_PROPERTY
public static final String IMPACT_RATE_PROPERTY
public static final String PERMUTATION_FUNCTION_PROPERTY
PermutationFunction
if using permutations is enabled.public static final String USE_PERMUTATIONS_PROPERTY
public static final String VECTOR_LENGTH_PROPERTY
public static final String WINDOW_SIZE_PROPERTY
public static final String USE_SPARSE_SEMANTICS_PROPERTY
public static final double DEFAULT_HISTORY_DECAY_RATE
public static final double DEFAULT_IMPACT_RATE
public static final int DEFAULT_VECTOR_LENGTH
public static final int DEFAULT_WINDOW_SIZE
public IncrementalSemanticAnalysis()
IncrementalSemanticAnalysis
instance using the
current System
properties for configuration.public IncrementalSemanticAnalysis(Properties properties)
IncrementalSemanticAnalysis
instance using the
provided properties for configuration.properties
- the properties that specify the configuration for this
instancepublic void clearSemantics()
public String getSpaceName()
getSpaceName
in interface SemanticSpace
public Vector getVector(String word)
getVector
in interface SemanticSpace
word
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public int getVectorLength()
processSpace
is called.getVectorLength
in interface SemanticSpace
public Map<String,TernaryVector> getWordToIndexVector()
IntegerVector
mapping used by this instance. Any further changes made by this instance
to its token to IntegerVector
mapping will be reflected in the
return map.public Set<String> getWords()
getWords
in interface SemanticSpace
public void processDocument(BufferedReader document) throws IOException
processDocument
in interface SemanticSpace
document
- a reader that allows access to the text of the documentIOException
- if any error occurs while reading the documentpublic void processSpace(Properties properties)
processSpace
in interface SemanticSpace
properties
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm.public void setWordToIndexVector(Map<String,TernaryVector> m)
IntegerVector
mapping to be used by this
instance. The contents of the map are copied, so any additions of new
index words by this instance will not be reflected in the parameter's
mapping.m
- a mapping from token to the IntegerVector
that should be
used represent it when calculating other word's semanticsCopyright © 2012. All Rights Reserved.