public class GenericWordSpace extends Object implements DimensionallyInterpretableSemanticSpace<String>, Filterable, Serializable
SemanticSpace
instances.
This class also provides for a slight variation on the basic model by
differentiating co-occurrences on the basis of their relative position to the
focus word. In such a case, for example, an occurrence of "red" two before
the focus word would be represented by a different position than "red" one
position before. This is reminiscent of the RandomIndexing
model with permutations.
However, unlike Random Indexing, this model is not fixed in the number of
dimensions it may use, with a possible numWords * windowSize * 2
dimensions. Such a large number of dimensions can negatively impact the
further operations on the semantic space's vectors, e.g., finding the most
similar vectors for a word.
The dimensions of this space are annotated with a description of what they represent. In the basic model, this will be the co-occurring word. In the model that takes into account word order, the description will include the relative position of the word.
This class defines the following configurable properties that may be set
using either the System properties or using the GenericWordSpace(Properties)
constructor.
"edu.ucla.sspace.gws.GenericWordSpace.windowSize"
"edu.ucla.sspace.gws.GenericWordSpace.useWordOrder"
false
This class implements Filterable
, which allows for fine-grained
control of which semantics are retained. The setSemanticFilter(Set)
method can be used to speficy which words should have their semantics
retained. Note that the words that are filtered out will still be used in
computing the semantics of other words. This behavior is intended for
use with a large corpora where retaining the semantics of all words in memory
is infeasible.
This class is thread-safe for concurrent calls of processDocument
. At any given point in
processing, the getVector
method may be used
to access the current semantics of a word. This allows callers to track
incremental changes to the semantics as the corpus is processed.
The processSpace
method does nothing for
this class and calls to it will not affect the results of getVector
.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_WINDOW_SIZE
The default number of words to view before and after each word in focus.
|
static String |
GWS_SSPACE_NAME |
static String |
USE_WORD_ORDER_PROPERTY
The property to specify whether the relative positions of a word's
co-occurrence should be use distinguished from each other.
|
static String |
WINDOW_SIZE_PROPERTY
The property to specify the number of words to view before and after each
word in focus.
|
Constructor and Description |
---|
GenericWordSpace()
Creates a new
GenericWordSpace instance using the current System properties for configuration. |
GenericWordSpace(int windowSize)
Creates a new
GenericWordSpace with the provided window size that
ignores word order. |
GenericWordSpace(int windowSize,
BasisMapping<Duple<String,Integer>,String> basis)
Creates a new
GenericWordSpace with the provided window size that
uses the specified basis mapping to map each co-occurrence at a specified
position to a dimension. |
GenericWordSpace(int windowSize,
boolean useWordOrder)
Creates a new
GenericWordSpace with the provided window size that
optionally includes word order. |
GenericWordSpace(Properties properties)
Creates a new
GenericWordSpace instance using the provided
properites for configuration. |
Modifier and Type | Method and Description |
---|---|
void |
clearSemantics()
Removes all associations between word and semantics while still retaining
the words' basis mapping.
|
String |
getDimensionDescription(int dimension)
Returns a description of what features with which the specified dimension
corresponds.
|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
SparseIntegerVector |
getVector(String word)
Returns the semantic vector for the provided word.
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
void |
processDocument(BufferedReader document)
Updates the semantic vectors based on the words in the document.
|
void |
processSpace(Properties properties)
Does nothing.
|
void |
setSemanticFilter(Set<String> semanticsToRetain)
Specifies the set of words that should have their semantics retained,
where all other words do not.
|
public static final String GWS_SSPACE_NAME
public static final String WINDOW_SIZE_PROPERTY
public static final String USE_WORD_ORDER_PROPERTY
public static final int DEFAULT_WINDOW_SIZE
public GenericWordSpace()
GenericWordSpace
instance using the current System
properties for configuration.public GenericWordSpace(Properties properties)
GenericWordSpace
instance using the provided
properites for configuration.public GenericWordSpace(int windowSize)
GenericWordSpace
with the provided window size that
ignores word order.public GenericWordSpace(int windowSize, boolean useWordOrder)
GenericWordSpace
with the provided window size that
optionally includes word order.public GenericWordSpace(int windowSize, BasisMapping<Duple<String,Integer>,String> basis)
GenericWordSpace
with the provided window size that
uses the specified basis mapping to map each co-occurrence at a specified
position to a dimension.basis
- a basis mapping from a duple that represents a word and its
relative position to a dimension.public void clearSemantics()
GenericWordSpace
on multiple corpora while keeping
the semantics of the dimensions identical.public String getDimensionDescription(int dimension)
getDimensionDescription
in interface DimensionallyInterpretableSemanticSpace<String>
dimension
- a dimension numberpublic SparseIntegerVector getVector(String word)
getVector
in interface SemanticSpace
word
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public String getSpaceName()
getSpaceName
in interface SemanticSpace
public int getVectorLength()
processSpace
is called.getVectorLength
in interface SemanticSpace
public Set<String> getWords()
getWords
in interface SemanticSpace
public void processDocument(BufferedReader document) throws IOException
processDocument
in interface SemanticSpace
document
- a reader that allows access to the text of the documentIOException
- if any error occurs while reading the documentpublic void processSpace(Properties properties)
processSpace
in interface SemanticSpace
properties
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm.public void setSemanticFilter(Set<String> semanticsToRetain)
setSemanticFilter
in interface Filterable
semanticsToRetain
- the set of words for which semantics should be
computed.Copyright © 2012. All Rights Reserved.