org.apache.commons.math3.ml.clustering
Class KMeansPlusPlusClusterer<T extends Clusterable>

java.lang.Object
  extended by org.apache.commons.math3.ml.clustering.Clusterer<T>
      extended by org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer<T>
Type Parameters:
T - type of the points to cluster

public class KMeansPlusPlusClusterer<T extends Clusterable>
extends Clusterer<T>

Clustering algorithm based on David Arthur and Sergei Vassilvitski k-means++ algorithm.

Since:
3.2
Version:
$Id: KMeansPlusPlusClusterer.java 1461866 2013-03-27 21:54:36Z tn $
See Also:
K-means++ (wikipedia)

Nested Class Summary
static class KMeansPlusPlusClusterer.EmptyClusterStrategy
          Strategies to use for replacing an empty cluster.
 
Field Summary
private  KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy
          Selected strategy for empty clusters.
private  int k
          The number of clusters.
private  int maxIterations
          The maximum number of iterations.
private  RandomGenerator random
          Random generator for choosing initial centers.
 
Constructor Summary
KMeansPlusPlusClusterer(int k)
          Build a clusterer.
KMeansPlusPlusClusterer(int k, int maxIterations)
          Build a clusterer.
KMeansPlusPlusClusterer(int k, int maxIterations, DistanceMeasure measure)
          Build a clusterer.
KMeansPlusPlusClusterer(int k, int maxIterations, DistanceMeasure measure, RandomGenerator random)
          Build a clusterer.
KMeansPlusPlusClusterer(int k, int maxIterations, DistanceMeasure measure, RandomGenerator random, KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy)
          Build a clusterer.
 
Method Summary
private  int assignPointsToClusters(List<CentroidCluster<T>> clusters, Collection<T> points, int[] assignments)
          Adds the given points to the closest Cluster.
private  Clusterable centroidOf(Collection<T> points, int dimension)
          Computes the centroid for a set of points.
private  List<CentroidCluster<T>> chooseInitialCenters(Collection<T> points)
          Use K-means++ to choose the initial centers.
 List<CentroidCluster<T>> cluster(Collection<T> points)
          Runs the K-means++ clustering algorithm.
 KMeansPlusPlusClusterer.EmptyClusterStrategy getEmptyClusterStrategy()
          Returns the KMeansPlusPlusClusterer.EmptyClusterStrategy used by this instance.
private  T getFarthestPoint(Collection<CentroidCluster<T>> clusters)
          Get the point farthest to its cluster center
 int getK()
          Return the number of clusters this instance will use.
 int getMaxIterations()
          Returns the maximum number of iterations this instance will use.
private  int getNearestCluster(Collection<CentroidCluster<T>> clusters, T point)
          Returns the nearest Cluster to the given point
private  T getPointFromLargestNumberCluster(Collection<? extends Cluster<T>> clusters)
          Get a random point from the Cluster with the largest number of points
private  T getPointFromLargestVarianceCluster(Collection<CentroidCluster<T>> clusters)
          Get a random point from the Cluster with the largest distance variance.
 RandomGenerator getRandomGenerator()
          Returns the random generator this instance will use.
 
Methods inherited from class org.apache.commons.math3.ml.clustering.Clusterer
distance, getDistanceMeasure
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

k

private final int k
The number of clusters.


maxIterations

private final int maxIterations
The maximum number of iterations.


random

private final RandomGenerator random
Random generator for choosing initial centers.


emptyStrategy

private final KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy
Selected strategy for empty clusters.

Constructor Detail

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(int k)
Build a clusterer.

The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

The euclidean distance will be used as default distance measure.

Parameters:
k - the number of clusters to split the data into

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(int k,
                               int maxIterations)
Build a clusterer.

The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

The euclidean distance will be used as default distance measure.

Parameters:
k - the number of clusters to split the data into
maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(int k,
                               int maxIterations,
                               DistanceMeasure measure)
Build a clusterer.

The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

Parameters:
k - the number of clusters to split the data into
maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
measure - the distance measure to use

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(int k,
                               int maxIterations,
                               DistanceMeasure measure,
                               RandomGenerator random)
Build a clusterer.

The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

Parameters:
k - the number of clusters to split the data into
maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
measure - the distance measure to use
random - random generator to use for choosing initial centers

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(int k,
                               int maxIterations,
                               DistanceMeasure measure,
                               RandomGenerator random,
                               KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy)
Build a clusterer.

Parameters:
k - the number of clusters to split the data into
maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
measure - the distance measure to use
random - random generator to use for choosing initial centers
emptyStrategy - strategy to use for handling empty clusters that may appear during algorithm iterations
Method Detail

getK

public int getK()
Return the number of clusters this instance will use.

Returns:
the number of clusters

getMaxIterations

public int getMaxIterations()
Returns the maximum number of iterations this instance will use.

Returns:
the maximum number of iterations, or -1 if no maximum is set

getRandomGenerator

public RandomGenerator getRandomGenerator()
Returns the random generator this instance will use.

Returns:
the random generator

getEmptyClusterStrategy

public KMeansPlusPlusClusterer.EmptyClusterStrategy getEmptyClusterStrategy()
Returns the KMeansPlusPlusClusterer.EmptyClusterStrategy used by this instance.

Returns:
the KMeansPlusPlusClusterer.EmptyClusterStrategy

cluster

public List<CentroidCluster<T>> cluster(Collection<T> points)
                                                     throws MathIllegalArgumentException,
                                                            ConvergenceException
Runs the K-means++ clustering algorithm.

Specified by:
cluster in class Clusterer<T extends Clusterable>
Parameters:
points - the points to cluster
Returns:
a list of clusters containing the points
Throws:
MathIllegalArgumentException - if the data points are null or the number of clusters is larger than the number of data points
ConvergenceException - if an empty cluster is encountered and the emptyStrategy is set to ERROR

assignPointsToClusters

private int assignPointsToClusters(List<CentroidCluster<T>> clusters,
                                   Collection<T> points,
                                   int[] assignments)
Adds the given points to the closest Cluster.

Parameters:
clusters - the Clusters to add the points to
points - the points to add to the given Clusters
assignments - points assignments to clusters
Returns:
the number of points assigned to different clusters as the iteration before

chooseInitialCenters

private List<CentroidCluster<T>> chooseInitialCenters(Collection<T> points)
Use K-means++ to choose the initial centers.

Parameters:
points - the points to choose the initial centers from
Returns:
the initial centers

getPointFromLargestVarianceCluster

private T getPointFromLargestVarianceCluster(Collection<CentroidCluster<T>> clusters)
                                                          throws ConvergenceException
Get a random point from the Cluster with the largest distance variance.

Parameters:
clusters - the Clusters to search
Returns:
a random point from the selected cluster
Throws:
ConvergenceException - if clusters are all empty

getPointFromLargestNumberCluster

private T getPointFromLargestNumberCluster(Collection<? extends Cluster<T>> clusters)
                                                        throws ConvergenceException
Get a random point from the Cluster with the largest number of points

Parameters:
clusters - the Clusters to search
Returns:
a random point from the selected cluster
Throws:
ConvergenceException - if clusters are all empty

getFarthestPoint

private T getFarthestPoint(Collection<CentroidCluster<T>> clusters)
                                        throws ConvergenceException
Get the point farthest to its cluster center

Parameters:
clusters - the Clusters to search
Returns:
point farthest to its cluster center
Throws:
ConvergenceException - if clusters are all empty

getNearestCluster

private int getNearestCluster(Collection<CentroidCluster<T>> clusters,
                              T point)
Returns the nearest Cluster to the given point

Parameters:
clusters - the Clusters to search
point - the point to find the nearest Cluster for
Returns:
the index of the nearest Cluster to the given point

centroidOf

private Clusterable centroidOf(Collection<T> points,
                               int dimension)
Computes the centroid for a set of points.

Parameters:
points - the set of points
dimension - the point dimension
Returns:
the computed centroid for the set of points


Copyright (c) 2003-2013 Apache Software Foundation