public class RandomIndexing extends Object implements SemanticSpace, Filterable
Random Indexing (RI) is an efficient way of capturing word co-occurence. In most co-occurence models, a word-by-word matrix is constructed, where the values denote how many times the columns's word occurred in the context of the row's word. RI instead represents co-occurrence through index vectors. Each word is assigned a high-dimensional, random vector that is known as its index vector. These index vectors are very sparse - typically 7 ± 2 non zero bits for a vector of length 2048, which ensures that the the chance of any two arbitrary index vectors having an overlapping meaning (i.e. a cosine similarity that is non-zero) is very low. Word semantics are calculated for each word by keeping a running sum of all of the index vectors for the words that co-occur.
Sahlgren et
al. (2008) introduced another variation on RI, where the semantics
also capture word order by using a permutation function. For each occurrence
of a word, rather than summing the index vectors of the co-occurring words,
the permutation function is used to transform the co-occurring words based on
their position. For example, consider the sentece, "the quick brown fox
jumps over the lazy dog." With a window-size of 2, the semantic vector for
"fox" is added with the values Π-2(quickindex) +
Π-1(brownindex) +
Π1(jumpsindex) +
Π2(overindex), where Πk
denotes the k
th permutation of the specified index vector.
This class defines the following configurable properties that may be set
using either the System properties or using the RandomIndexing(Properties)
constructor.
"edu.ucla.sspace.ri.RandomIndexing.windowSize"
"edu.ucla.sspace.ri.RandomIndexing.vectorLength"
"edu.ucla.sspace.ri.RandomIndexing.usePermutations"
false
"edu.ucla.sspace.ri.RandomIndexing.permutationFunction"
DefaultPermutationFunction
PermutationFunction
instance that will be used
to permute index vectors. If the "edu.ucla.sspace.ri.RandomIndexing.usePermutations" is
set to false
, the value of this property has no effect.
"edu.ucla.sspace.ri.RandomIndexing.sparseSemantics"
true
This class implements Filterable
, which allows for fine-grained
control of which semantics are retained. The setSemanticFilter(Set)
method can be used to speficy which words should have their semantics
retained. Note that the words that are filtered out will still be used in
computing the semantics of other words. This behavior is intended for
use with a large corpora where retaining the semantics of all words in memory
is infeasible.
This class is thread-safe for concurrent calls of processDocument
. At any given point in
processing, the getVector
method may be used
to access the current semantics of a word. This allows callers to track
incremental changes to the semantics as the corpus is processed.
The processSpace
method does nothing for
this class and calls to it will not affect the results of getVectorFor
.
PermutationFunction
,
IndexVectorGenerator
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_VECTOR_LENGTH
The default number of dimensions to be used by the index and semantic
vectors.
|
static int |
DEFAULT_WINDOW_SIZE
The default number of words to view before and after each word in focus.
|
static String |
PERMUTATION_FUNCTION_PROPERTY
The property to specify the fully qualified named of a
PermutationFunction if using permutations is enabled. |
static String |
RI_SSPACE_NAME |
static String |
USE_PERMUTATIONS_PROPERTY
The property to specify whether the index vectors for co-occurrent words
should be permuted based on their relative position.
|
static String |
USE_SPARSE_SEMANTICS_PROPERTY
Specifies whether to use a sparse encoding for each word's semantics,
which saves space but requires more computation.
|
static String |
VECTOR_LENGTH_PROPERTY
The property to specify the number of dimensions to be used by the index
and semantic vectors.
|
static String |
WINDOW_SIZE_PROPERTY
The property to specify the number of words to view before and after each
word in focus.
|
Constructor and Description |
---|
RandomIndexing()
Creates a new
RandomIndexing instance using the current System properties for configuration. |
RandomIndexing(Properties properties)
Creates a new
RandomIndexing instance using the provided
properites for configuration. |
Modifier and Type | Method and Description |
---|---|
void |
clearSemantics()
Removes all associations between word and semantics while still retaining
the word to index vector mapping.
|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
Vector |
getVector(String word)
Returns the semantic vector for the provided word.
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
Map<String,TernaryVector> |
getWordToIndexVector()
Returns an unmodifiable view on the token to
IntegerVector
mapping used by this instance. |
void |
processDocument(BufferedReader document)
Updates the semantic vectors based on the words in the document.
|
void |
processSpace(Properties properties)
Does nothing.
|
void |
setSemanticFilter(Set<String> semanticsToRetain)
Specifies the set of words that should have their semantics retained,
where all other words do not.
|
void |
setWordToIndexVector(Map<String,TernaryVector> m)
Assigns the token to
IntegerVector mapping to be used by this
instance. |
public static final String RI_SSPACE_NAME
public static final String VECTOR_LENGTH_PROPERTY
public static final String WINDOW_SIZE_PROPERTY
public static final String USE_PERMUTATIONS_PROPERTY
public static final String PERMUTATION_FUNCTION_PROPERTY
PermutationFunction
if using permutations is enabled.public static final String USE_SPARSE_SEMANTICS_PROPERTY
public static final int DEFAULT_WINDOW_SIZE
public static final int DEFAULT_VECTOR_LENGTH
public RandomIndexing()
RandomIndexing
instance using the current System
properties for configuration.public RandomIndexing(Properties properties)
RandomIndexing
instance using the provided
properites for configuration.public void clearSemantics()
RandomIndexing
on multiple corpora while
keeping the same semantic space.public Vector getVector(String word)
getVector
in interface SemanticSpace
word
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public String getSpaceName()
getSpaceName
in interface SemanticSpace
public int getVectorLength()
processSpace
is called.getVectorLength
in interface SemanticSpace
public Set<String> getWords()
getWords
in interface SemanticSpace
public Map<String,TernaryVector> getWordToIndexVector()
IntegerVector
mapping used by this instance. Any further changes made by this instance
to its token to IntegerVector
mapping will be reflected in the
returned map.public void processDocument(BufferedReader document) throws IOException
processDocument
in interface SemanticSpace
document
- a reader that allows access to the text of the documentIOException
- if any error occurs while reading the documentpublic void processSpace(Properties properties)
processSpace
in interface SemanticSpace
properties
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm.public void setWordToIndexVector(Map<String,TernaryVector> m)
IntegerVector
mapping to be used by this
instance. The contents of the map are copied, so any additions of new
index words by this instance will not be reflected in the parameter's
mapping.m
- a mapping from token to the IntegerVector
that should be
used represent it when calculating other word's semanticspublic void setSemanticFilter(Set<String> semanticsToRetain)
setSemanticFilter
in interface Filterable
semanticsToRetain
- the set of words for which semantics should be
computed.Copyright © 2012. All Rights Reserved.