Classify a set of observations into k clusters using the k-means algorithm.
The algorithm attempts to minimize the Euclidean distance between
observations and centroids. Several initialization methods are
included.
Parameters:
datandarray
A ‘M’ by ‘N’ array of ‘M’ observations in ‘N’ dimensions or a length
‘M’ array of ‘M’ 1-D observations.
kint or ndarray
The number of clusters to form as well as the number of
centroids to generate. If minit initialization string is
‘matrix’, or if a ndarray is given instead, it is
interpreted as initial cluster to use instead.
iterint, optional
Number of iterations of the k-means algorithm to run. Note
that this differs in meaning from the iters parameter to
the kmeans function.
threshfloat, optional
(not used yet)
minitstr, optional
Method for initialization. Available methods are ‘random’,
‘points’, ‘++’ and ‘matrix’:
‘random’: generate k centroids from a Gaussian with mean and
variance estimated from the data.
‘points’: choose k observations (rows) at random from data for
the initial centroids.
‘++’: choose k observations accordingly to the kmeans++ method
(careful seeding)
‘matrix’: interpret the k parameter as a k by M (or length k
array for 1-D data) array of initial centroids.
missingstr, optional
Method to deal with empty clusters. Available methods are
‘warn’ and ‘raise’:
‘warn’: give a warning and continue.
‘raise’: raise an ClusterError and terminate the algorithm.
check_finitebool, optional
Whether to check that the input matrices contain only finite numbers.
Disabling may give a performance gain, but may result in problems
(crashes, non-termination) if the inputs do contain infinities or NaNs.
Default: True
If rng is passed by keyword, types other than numpy.random.Generator are
passed to numpy.random.default_rng to instantiate a Generator.
If rng is already a Generator instance, then the provided instance is
used. Specify rng for repeatable function behavior.
If this argument is passed by position or seed is passed by keyword,
legacy behavior for the argument seed applies:
If seed is an int, a new RandomState instance is used,
seeded with seed.
If seed is already a Generator or RandomState instance then
that instance is used.
Changed in version 1.15.0: As part of the SPEC-007
transition from use of numpy.random.RandomState to
numpy.random.Generator, this keyword was changed from seed to rng.
For an interim period, both keywords will continue to work, although only one
may be specified at a time. After the interim period, function calls using the
seed keyword will emit warnings. The behavior of both seed and
rng are outlined above, but only the rng keyword should be used in new code.
Returns:
centroidndarray
A ‘k’ by ‘N’ array of centroids found at the last iteration of
k-means.
labelndarray
label[i] is the code or index of the centroid the
ith observation is closest to.
kmeans2 has experimental support for Python Array API Standard compatible
backends in addition to NumPy. Please consider testing these features
by setting an environment variable SCIPY_ARRAY_API=1 and providing
CuPy, PyTorch, JAX, or Dask arrays as array arguments. The following
combinations of backend and device (or other capability) are supported.
D. Arthur and S. Vassilvitskii, "k-means++: the advantages of
careful seeding", Proceedings of the Eighteenth Annual ACM-SIAM Symposium
on Discrete Algorithms, 2007.