This clustering algorithm needs a neighborhood graph on the points, and an estimation of the density at each point.
A few possible graph constructions and density estimators are provided for convenience, but it is perfectly natural
to provide your own.
:Requires: `SciPy <installation.html#scipy>`_, `Scikit-learn <installation.html#scikit-learn>`_ or others
(see :class:`~gudhi.point_cloud.knn.KNearestNeighbors`) in function of the options.
Attributes
----------
n_clusters_: int
The number of clusters. Writing to it automatically adjusts `labels_`.
merge_threshold_: float
minimum prominence of a cluster so it doesn't get merged. Writing to it automatically adjusts `labels_`.
n_leaves_: int
number of leaves (unstable clusters) in the hierarchical tree
leaf_labels_: ndarray of shape (n_samples,)
cluster labels for each point, at the very bottom of the hierarchy
labels_: ndarray of shape (n_samples,)
cluster labels for each point, after merging
diagram_: ndarray of shape (`n_leaves_`, 2)
persistence diagram (only the finite points)
max_weight_per_cc_: ndarray of shape (n_connected_components,)
maximum of the density function on each connected component. This corresponds to the abscissa of infinite
points in the diagram
children_: ndarray of shape (`n_leaves_`-n_connected_components, 2)
The children of each non-leaf node. Values less than `n_leaves_` correspond to leaves of the tree.
A node i greater than or equal to `n_leaves_` is a non-leaf node and has children children_[i - `n_leaves_`].
Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node `n_leaves_` + i
weights_: ndarray of shape (n_samples,)
weights of the points, as computed by the density estimator or provided by the user
params_: dict
Parameters like metric, etc
def gudhi.clustering.tomato.Tomato.__init__ |
( |
|
self, |
|
|
|
graph_type = "knn" , |
|
|
|
density_type = "logDTM" , |
|
|
|
n_clusters = None , |
|
|
|
merge_threshold = None , |
|
|
** |
params |
|
) |
| |
Args:
graph_type (str): 'manual', 'knn' or 'radius'. Default is 'knn'.
density_type (str): 'manual', 'DTM', 'logDTM', 'KDE' or 'logKDE'. When you have many points,
'KDE' and 'logKDE' tend to be slower. Default is 'logDTM'.
metric (str|Callable): metric used when calculating the distance between instances in a feature array.
Defaults to Minkowski of parameter p.
kde_params (dict): if density_type is 'KDE' or 'logKDE', additional parameters passed directly to
sklearn.neighbors.KernelDensity.
k (int): number of neighbors for a knn graph (including the vertex itself). Defaults to 10.
k_DTM (int): number of neighbors for the DTM density estimation (including the vertex itself).
Defaults to k.
r (float): size of a neighborhood if graph_type is 'radius'. Also used as default bandwidth in kde_params.
eps (float): (1+eps) approximation factor when computing distances (ignored in many cases).
n_clusters (int): number of clusters requested. Defaults to None, i.e. no merging occurs and we get
the maximal number of clusters.
merge_threshold (float): minimum prominence of a cluster so it doesn't get merged.
symmetrize_graph (bool): whether we should add edges to make the neighborhood graph symmetric.
This can be useful with k-NN for small k. Defaults to false.
p (float): norm L^p on input points. Defaults to 2.
q (float): order used to compute the distance to measure. Defaults to dim.
Beware that when the dimension is large, this can easily cause overflows.
dim (float): final exponent in DTM density estimation, representing the dimension. Defaults to the
dimension, or 2 when the dimension cannot be read from the input (metric is "precomputed").
n_jobs (int): Number of jobs to schedule for parallel processing on the CPU.
If -1 is given all processors are used. Default: 1.
params: extra parameters are passed to :class:`~gudhi.point_cloud.knn.KNearestNeighbors` and
:class:`~gudhi.point_cloud.dtm.DTMDensity`.
def gudhi.clustering.tomato.Tomato.fit |
( |
|
self, |
|
|
|
X, |
|
|
|
y = None , |
|
|
|
weights = None |
|
) |
| |
Args:
X ((n,d)-array of float|(n,n)-array of float|Sequence[Iterable[int]]): coordinates of the points,
or distance matrix (full, not just a triangle) if metric is "precomputed", or list of neighbors
for each point (points are represented by their index, starting from 0) if graph_type is "manual".
The number of points is currently limited to about 2 billion.
weights (ndarray of shape (n_samples)): if density_type is 'manual', a density estimate at each point
y: Not used, present here for API consistency with scikit-learn by convention.