public class GeneralizedOrssSeed extends Object implements KMeansSeed
OrssSeed implementation, this implementation is based on
using a SimilarityFunction to compare points. In contrast, the
OrssSeed uses the Euclidean distance to compare points as in the
Ostrovsky et al. (2006) formulation. The properties defined in the ORSS
paper are preserved if the similarity is defined as the inverse of the
squared Euclidean distances, which produces the same results as the OrssSeed implemenation. However, this implemenation generalizes the notion
of distance to inverse-similarity, which allows data to be compared using
alternate methods, such as CosineSimilarity, which is frequently used in
comparing text documents. Note that the similarity values returned by any
SimilarityFunction used by this class must always be non-negative.
In addition, this class provides an additional overload of the chooseSeeds method that allows the input data points to be weighed.
Weighting enables finding seeds where the input are representative of
different sample sizes.
This implementation is in part derived from the ORSS seed implementation of Michael Shindler as a part of the Fast Streaming K-Means implementation available here.
OrssSeed| Constructor and Description |
|---|
GeneralizedOrssSeed(SimilarityFunction simFunc) |
| Modifier and Type | Method and Description |
|---|---|
DoubleVector[] |
chooseSeeds(int k,
Matrix dataPoints)
Selects
k rows of dataPoints to be seeds of a
k-means instance. |
DoubleVector[] |
chooseSeeds(Matrix dataPoints,
int k,
int[] weights)
Selects
k rows of dataPoints, weighted by the specified
amount, to be seeds of a k-means instance. |
public GeneralizedOrssSeed(SimilarityFunction simFunc)
public DoubleVector[] chooseSeeds(int k, Matrix dataPoints)
k rows of dataPoints to be seeds of a
k-means instance. If more seeds are requested than are available,
all possible rows are returned.chooseSeeds in interface KMeansSeeddataPoints - a matrix whose rows are to be evaluated and from which
k data points will be selectedk - the number of data points (rows) to selectpublic DoubleVector[] chooseSeeds(Matrix dataPoints, int k, int[] weights)
k rows of dataPoints, weighted by the specified
amount, to be seeds of a k-means instance. If more seeds are
requested than are available, all possible rows are returned.dataPoints - a matrix whose rows are to be evaluated and from which
k data points will be selectedk - the number of data points (rows) to selectweights - as set of scalar int weights that reflect the importance
of each data points.Copyright © 2012. All Rights Reserved.