public abstract class OrderedTemporalRandomIndexing extends Object implements TemporalSemanticSpace, Filterable
TemporalRandomIndexing
that imposes
restrictions on the document input ordering to improve efficiency at the cost
of functionality. Specifically, this class assumes:
Because each slice is calculated and then discarded, this class provides a
way for users to be notified when a semantic slice has been completed. Users
may add a Runnable
via the addPartitionHook(Runnable)
method.
When the input stream of documents partitions the current semantic slice from
the next (i.e. the slice is complete), each runnable will be invoked. This
allows users to perform any operations on the slice as necessary, such as
save it to disk or compute various statistics.
This class implements Filterable
, which allows for fine-grained
control of which semantics are retained. The setSemanticFilter(Set)
method can be used to speficy which words should have their semantics
retained. Note that the words that are filtered out will still be used in
computing the semantics of other words. This behavior is intended for
use with a large corpora where retaining the semantics of all words in memory
is infeasible.
This base class defines the following configurable properties:
"edu.ucla.sspace.tri.OrderedTemporalRandomIndexing.windowSize"
5
words are counted before and 5
words are counter
after. This class always uses a symmetric window.
"edu.ucla.sspace.tri.OrderedTemporalRandomIndexing.vectorLength"
"edu.ucla.sspace.tri.OrderedTemporalRandomIndexing.sparseSemantics"
true
Due to the ordered nature of its processing, great care must be used when
invoking processDocument
from multiple threads. Multiple threads may
order the documents such that the time stamps at semantic slice boundaries
overlap. This may causes the shouldPartitionSpace(long)
method to
return true for slices with only a single document. Subclasses must make it
clear whether any such multithreading behavior is permissable and how to
correctly invoke it to avoid triggering semantic slice boundary edge cases.
In its base behavior, instances of this class do not support the
optional getTimeSteps
, getVectorAfter
, getVectorBefore
and getVectorBetween
methods. However, subclasses
may add this functionality.
RandomIndexing
,
TemporalRandomIndexing
,
TemporalSemanticSpace
Modifier and Type | Field and Description |
---|---|
protected RandomIndexing |
currentSlice
The current semantic slice, which is updated as new documents are
processed and has its semantics cleared when
shouldPartitionSpace(long) returns true . |
static int |
DEFAULT_VECTOR_LENGTH
The default number of dimensions to be used by the index and semantic
vectors.
|
static int |
DEFAULT_WINDOW_SIZE
The default number of words to view before and after each word in focus.
|
protected Long |
endTime
The most recent time stamp seen during the current semantic slice
|
protected Collection<Runnable> |
partitionHooks
The collection of hooks that are to be run prior to every time this
instances partitions its semantic space.
|
static String |
PERMUTATION_FUNCTION_PROPERTY
The property to specify the fully qualified named of a
edu.ucla.sspace.ri.PermutationFunction if using permutations is enabled. |
protected Long |
startTime
The least recent time stamp seen during the current semantic slice
|
static String |
USE_PERMUTATIONS_PROPERTY
The property to specify whether the index vectors for co-occurrent words
should be permuted based on their relative position.
|
static String |
USE_SPARSE_SEMANTICS_PROPERTY
Specifies whether to use a sparse encoding for each word's semantics,
which saves space but requires more computation.
|
static String |
VECTOR_LENGTH_PROPERTY
The property to specify the number of dimensions to be used by the index
and semantic vectors.
|
static String |
WINDOW_SIZE_PROPERTY
The property to specify the number of words to view before and after each
word in focus.
|
Constructor and Description |
---|
OrderedTemporalRandomIndexing()
Creates an instance of
OrderedTemporalRandomIndexing using
the system properties to configure the behavior. |
OrderedTemporalRandomIndexing(Properties props)
Creates an instance of
OrderedTemporalRandomIndexing using
the system properties to configure the behavior. |
Modifier and Type | Method and Description |
---|---|
void |
addPartitionHook(Runnable hook)
Adds the provided
Runnable to the list of hooks that will be
invoked immediately prior to the partitioning of this space. |
protected void |
clear()
Clears the semantic content of this space as a part of the partitioning
processing.
|
Long |
endTime()
Returns the time for the latest semantics contained within this space.
|
abstract String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
SortedSet<Long> |
getTimeSteps(String word)
Not supported
|
Vector |
getVector(String word)
Returns the provided word's semantic vector based on all temporal
occurrences.
|
Vector |
getVectorAfter(String word,
long startTime)
Not supported
|
Vector |
getVectorBefore(String word,
long endTime)
Not supported
|
Vector |
getVectorBetween(String word,
long startTime,
long endTime)
Not supported
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
Map<String,TernaryVector> |
getWordToIndexVector()
Returns an unmodifiable view on the token to
TernaryVector
mapping used by this instance. |
void |
processDocument(BufferedReader document)
Processes the contents of the provided reader as a document, using the
current time as the timestamp.
|
void |
processDocument(BufferedReader document,
long timeStamp)
Processes the contents of the provided buffer as a document, using the
provided timestamp as the date when the document was written.
|
void |
processSpace(Properties props)
Does nothing.
|
void |
setSemanticFilter(Set<String> semanticsToRetain)
Sets a filter such that only words that are in the set have their
semantics retained by this instance.
|
void |
setWordToIndexVector(Map<String,TernaryVector> m)
Assigns the token to
TernaryVector mapping to be used by this
instance. |
protected abstract boolean |
shouldPartitionSpace(long nextTimeStamp)
Returns
true if the current contents of this semantic space
should be partitioned and discarded prior to processing the next
document with the specified time stamp. |
Long |
startTime()
Returns the time for the earliest semantics contained within this space.
|
public static final String PERMUTATION_FUNCTION_PROPERTY
edu.ucla.sspace.ri.PermutationFunction
if using permutations is enabled.public static final String USE_PERMUTATIONS_PROPERTY
public static final String USE_SPARSE_SEMANTICS_PROPERTY
public static final String VECTOR_LENGTH_PROPERTY
public static final String WINDOW_SIZE_PROPERTY
public static final int DEFAULT_VECTOR_LENGTH
public static final int DEFAULT_WINDOW_SIZE
protected final Collection<Runnable> partitionHooks
protected final RandomIndexing currentSlice
shouldPartitionSpace(long)
returns true
.protected Long endTime
protected Long startTime
public OrderedTemporalRandomIndexing()
OrderedTemporalRandomIndexing
using
the system properties to configure the behavior.public OrderedTemporalRandomIndexing(Properties props)
OrderedTemporalRandomIndexing
using
the system properties to configure the behavior.props
- the properties used to configure this instancepublic void addPartitionHook(Runnable hook)
Runnable
to the list of hooks that will be
invoked immediately prior to the partitioning of this space. This
method provides a mechanism for users to perform additional processing on
the current semantic slice of this space before it is discarded.hook
- a runnable to be invoked.protected void clear()
public void processDocument(BufferedReader document) throws IOException
processDocument
in interface SemanticSpace
processDocument
in interface TemporalSemanticSpace
document
- a reader that allows access to the text of the documentIOException
- if any error occurs while reading the documentpublic void processDocument(BufferedReader document, long timeStamp) throws IOException
processDocument
in interface TemporalSemanticSpace
document
- a reader that allows access to the text of the documenttimeStamp
- the time at which the document was writtenIOException
- if any error occurs while reading the documentpublic void setSemanticFilter(Set<String> semanticsToRetain)
setSemanticFilter
in interface Filterable
semanticsToRetain
- the set of words for which semantics should be
computed.protected abstract boolean shouldPartitionSpace(long nextTimeStamp)
true
if the current contents of this semantic space
should be partitioned and discarded prior to processing the next
document with the specified time stamp. Subclasses should use this
method to specify the conditions under which the temporal semantics are
to be divided.nextTimeStamp
- the time stamp of the next document that has yet to
be processedtrue
if the current contents of this space should be
partitioned and discarded before processing the next documentpublic Long startTime()
startTime
in interface TemporalSemanticSpace
public Long endTime()
endTime
in interface TemporalSemanticSpace
public abstract String getSpaceName()
getSpaceName
in interface SemanticSpace
public SortedSet<Long> getTimeSteps(String word)
getTimeSteps
in interface TemporalSemanticSpace
word
- UnsupportedOperationException
- if calledpublic Vector getVectorAfter(String word, long startTime)
getVectorAfter
in interface TemporalSemanticSpace
word
- a word in the semantic spacestartTime
- a UNIX timestamp that denotes the time after which all
occurrences of the provided word should be counted.null
if the word was not in the space.UnsupportedOperationException
- if calledpublic Vector getVectorBefore(String word, long endTime)
getVectorBefore
in interface TemporalSemanticSpace
word
- a word in the semantic spaceendTime
- a UNIX timestamp that denotes the time before which all
occurrences of the provided would should be counted.null
if the word was not in the space.UnsupportedOperationException
- if calledpublic Vector getVectorBetween(String word, long startTime, long endTime)
getVectorBetween
in interface TemporalSemanticSpace
word
- a word in the semantic spacestartTime
- a UNIX timestamp that denotes the time before which
no occurrences of the word should be counted.endTime
- a UNIX timestamp that denotes the time after which no
occurrences of the word should be counted.null
if the word was not in the space.UnsupportedOperationException
- if calledpublic Vector getVector(String word)
getVector
in interface SemanticSpace
getVector
in interface TemporalSemanticSpace
word
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public int getVectorLength()
processSpace
is called.getVectorLength
in interface SemanticSpace
public Set<String> getWords()
getWords
in interface SemanticSpace
public Map<String,TernaryVector> getWordToIndexVector()
TernaryVector
mapping used by this instance. Any further changes made by this instance
to its token to TernaryVector
mapping will be reflected in the
return map.public void processSpace(Properties props)
processSpace
in interface SemanticSpace
props
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm.public void setWordToIndexVector(Map<String,TernaryVector> m)
TernaryVector
mapping to be used by this
instance. The contents of the map are copied, so any additions of new
index words by this instance will not be reflected in the parameter's
mapping.m
- a mapping from token to the TernaryVector
that should be
used represent it when calculating other word's semanticsCopyright © 2012. All Rights Reserved.