public class HyperspaceAnalogueToLanguage extends Object implements SemanticSpace
SemanticSpace
implementation of the Hyperspace Analogue to Language
(HAL) algorithm described by Lund and Burgess. This implementation is based
on the following paper: HAL is based on recording the co-occurrence of words in a sparse matrix. HAL also incorporates word order information by treating the co-occurrences of two words x y as being different than y x. Each word is assigned a unique index in the co-occurrence matrix. For some word x, when another word x co-occurs before, matrix entry x,y is update. Similarly, when y co-occurs after, the matrix entry y,x is updated. Therefore the full semantic vector for any words is its row vector concatenated with its column vector.
Typically, the full vectors are used (for an N x N matrix, these are 2*N in length). However, HAL also offers two posibilities for dimensionality reduction. Not all columns provide equal amount of information that can be used to distinguish the meanings of the words. Specifically, the information theoretic entropy of each column can be calculated as a way of ordering the columns by their importance. Using this ranking, either a fixed number of columns may be retained, or a threshold may be set to filter out low-entropy columns.
This class provides four parameters that may be set:
"edu.ucla.sspace.hal.HyperspaceAnalogueToLanguage.windowSize"
"edu.ucla.sspace.hal.weighting"
edu.ucla.sspace.hal.LinearWeighting
WeightingFunction
class that will be used to
determine how to weigh co-occurrences. HAL traditionally uses a ramped,
linear weighting where those words occurring closets receive more
weight, with a linear decrease based on distance.
"edu.ucla.sspace.hal.retainColumns"
"edu.ucla.sspace.hal.HyperspaceAnalogueToLanguage.threshold"
Note that the weight function can also be used to create special cases of the
HAL model, For example, an asymmetric window could be created by assigning a
weight of 0
to all those co-occurrence on one side.
SemanticSpace
,
WeightingFunction
Modifier and Type | Field and Description |
---|---|
static WeightingFunction |
DEFAULT_WEIGHTING
The default
WeightingFunction to use. |
static int |
DEFAULT_WINDOW_SIZE
The default number of words before and after the focus word to include
|
static String |
ENTROPY_THRESHOLD_PROPERTY
The property to specify the minimum entropy theshold a word should have
to be included in the vector space after processing.
|
static String |
RETAIN_PROPERTY
The property to specify the number of words to view before and after each
word in focus.
|
static String |
WEIGHTING_FUNCTION_PROPERTY
The property to set the
WeightingFunction to be used with
weighting the co-occurrence of neighboring words based on their distance. |
static String |
WINDOW_SIZE_PROPERTY
The property to specify the number of words to view before and after each
word in focus.
|
Constructor and Description |
---|
HyperspaceAnalogueToLanguage()
Constructs a new instance using the system properties for configuration.
|
HyperspaceAnalogueToLanguage(Properties properties)
Constructs a new instance using the provided properties for
configuration.
|
Modifier and Type | Method and Description |
---|---|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
Vector |
getVector(String word)
Returns the semantic vector for the provided word.
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
void |
processDocument(BufferedReader document)
Processes the contents of the provided file as a document.
|
void |
processSpace(Properties properties)
Once all the documents have been processed, performs any post-processing
steps on the data.
|
public static final String ENTROPY_THRESHOLD_PROPERTY
public static final String WINDOW_SIZE_PROPERTY
public static final String RETAIN_PROPERTY
public static final String WEIGHTING_FUNCTION_PROPERTY
WeightingFunction
to be used with
weighting the co-occurrence of neighboring words based on their distance.public static final int DEFAULT_WINDOW_SIZE
public static final WeightingFunction DEFAULT_WEIGHTING
WeightingFunction
to use.public HyperspaceAnalogueToLanguage()
public HyperspaceAnalogueToLanguage(Properties properties)
public void processDocument(BufferedReader document) throws IOException
processDocument
in interface SemanticSpace
document
- a reader that allows access to the text of the documentIOException
- if any error occurs while reading the documentpublic Set<String> getWords()
getWords
in interface SemanticSpace
public Vector getVector(String word)
getVector
in interface SemanticSpace
word
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public int getVectorLength()
processSpace
is called.getVectorLength
in interface SemanticSpace
public void processSpace(Properties properties)
properties
argument.
By general contract, once this method has been called, processDocument
will not be called again.
processSpace
in interface SemanticSpace
properties
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm.public String getSpaceName()
getSpaceName
in interface SemanticSpace
Copyright © 2012. All Rights Reserved.