public class GeneralizedOrssSeed extends Object implements KMeansSeed
OrssSeed
implementation, this implementation is based on
using a SimilarityFunction
to compare points. In contrast, the
OrssSeed
uses the Euclidean distance to compare points as in the
Ostrovsky et al. (2006) formulation. The properties defined in the ORSS
paper are preserved if the similarity is defined as the inverse of the
squared Euclidean distances, which produces the same results as the OrssSeed
implemenation. However, this implemenation generalizes the notion
of distance to inverse-similarity, which allows data to be compared using
alternate methods, such as CosineSimilarity
, which is frequently used in
comparing text documents. Note that the similarity values returned by any
SimilarityFunction
used by this class must always be non-negative.
In addition, this class provides an additional overload of the chooseSeeds
method that allows the input data points to be weighed.
Weighting enables finding seeds where the input are representative of
different sample sizes.
This implementation is in part derived from the ORSS seed implementation of Michael Shindler as a part of the Fast Streaming K-Means implementation available here.
OrssSeed
Constructor and Description |
---|
GeneralizedOrssSeed(SimilarityFunction simFunc) |
Modifier and Type | Method and Description |
---|---|
DoubleVector[] |
chooseSeeds(int k,
Matrix dataPoints)
Selects
k rows of dataPoints to be seeds of a
k-means instance. |
DoubleVector[] |
chooseSeeds(Matrix dataPoints,
int k,
int[] weights)
Selects
k rows of dataPoints , weighted by the specified
amount, to be seeds of a k-means instance. |
public GeneralizedOrssSeed(SimilarityFunction simFunc)
public DoubleVector[] chooseSeeds(int k, Matrix dataPoints)
k
rows of dataPoints
to be seeds of a
k-means instance. If more seeds are requested than are available,
all possible rows are returned.chooseSeeds
in interface KMeansSeed
dataPoints
- a matrix whose rows are to be evaluated and from which
k
data points will be selectedk
- the number of data points (rows) to selectpublic DoubleVector[] chooseSeeds(Matrix dataPoints, int k, int[] weights)
k
rows of dataPoints
, weighted by the specified
amount, to be seeds of a k-means instance. If more seeds are
requested than are available, all possible rows are returned.dataPoints
- a matrix whose rows are to be evaluated and from which
k
data points will be selectedk
- the number of data points (rows) to selectweights
- as set of scalar int weights that reflect the importance
of each data points.Copyright © 2012. All Rights Reserved.