Commit b7cd425

avi09trekhleb

and

authored

Added kmeans clustering (trekhleb#595)

* added kmeans * added kmeans * added kmeans Co-authored-by: Oleksii Trekhleb <trehleb@gmail.com>

1 parent 90ec1b7 commit b7cd425Copy full SHA for b7cd425

File tree

4 files changed

+167

-0

lines changed

README.md
src/algorithms/ml/kmeans
- README.md
- __test__
  - kmeans.test.js
- kmeans.js

4 files changed

+167

-0

lines changed

`‎README.md‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -147,6 +147,7 @@ a set of rules that precisely define a sequence of operations.`
`147`	`147`	`* Machine Learning`
`148`	`148`	* `B` [NanoNeuron](https://github.com/trekhleb/nano-neuron) - 7 simple JS functions that illustrate how machines can actually learn (forward/backward propagation)
`149`	`149`	* `B` [k-NN](src/algorithms/ml/knn) - k-nearest neighbors classification algorithm
	`150`	+ * `B` [k-Means](src/algorithms/ml/kmeans) - k-Means clustering algorithm
`150`	`151`	`* Uncategorized`
`151`	`152`	* `B` [Tower of Hanoi](src/algorithms/uncategorized/hanoi-tower)
`152`	`153`	* `B` [Square Matrix Rotation](src/algorithms/uncategorized/square-matrix-rotation) - in-place algorithm

`‎src/algorithms/ml/kmeans/README.md‎`

Lines changed: 32 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,32 @@`
	`1`	`+# k-Means Algorithm`
	`2`	`+`
	`3`	`+The k-Means algorithm is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimentions of vectors.`
	`4`	`+`
	`5`	`+In k-Means classification, the output is a set of classess asssigned to each vector. Each cluster location is continously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.`
	`6`	`+`
	`7`	`+The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used mostly for this task.`
	`8`	`+`
	`9`	`+![Euclidean distance between two points](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)`
	`10`	`+`
	`11`	`+_Image source: [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)_`
	`12`	`+`
	`13`	`+The algorithm is as follows:`
	`14`	`+`
	`15`	`+1. Check for errors like invalid/inconsistent data`
	`16`	`+2. Initialize the k cluster locations with initial/random k points`
	`17`	`+3. Calculate the distance of each data point from each cluster`
	`18`	`+4. Assign the cluster label of each data point equal to that of the cluster at it's minimum distance`
	`19`	`+5. Calculate the centroid of each cluster based on the data points it contains`
	`20`	`+6. Repeat each of the above steps until the centroid locations are varying`
	`21`	`+`
	`22`	`+Here is a visualization of k-Means clustering for better understanding:`
	`23`	`+`
	`24`	`+![KNN Visualization 1](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)`
	`25`	`+`
	`26`	`+_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)_`
	`27`	`+`
	`28`	+The centroids are moving continously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between itrations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
	`29`	`+`
	`30`	`+## References`
	`31`	`+`
	`32`	`+- [k-Means neighbors algorithm on Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering)`

`‎src/algorithms/ml/kmeans/test/kmeans.test.js‎`

Lines changed: 36 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,36 @@`
	`1`	`+import kMeans from '../kmeans';`
	`2`	`+`
	`3`	`+describe('kMeans', () => {`
	`4`	`+ it('should throw an error on invalid data', () => {`
	`5`	`+ expect(() => {`
	`6`	`+ kMeans();`
	`7`	`+ }).toThrowError('Either dataSet or labels or toClassify were not set');`
	`8`	`+ });`
	`9`	`+`
	`10`	`+ it('should throw an error on inconsistent data', () => {`
	`11`	`+ expect(() => {`
	`12`	`+ kMeans([[1, 2], [1]], 2);`
	`13`	`+ }).toThrowError('Inconsistent vector lengths');`
	`14`	`+ });`
	`15`	`+`
	`16`	`+ it('should find the nearest neighbour', () => {`
	`17`	`+ const dataSet = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];`
	`18`	`+ const k = 2;`
	`19`	`+ const expectedCluster = [0, 1, 0, 1, 1, 0, 1];`
	`20`	`+ expect(kMeans(dataSet, k)).toEqual(expectedCluster);`
	`21`	`+ });`
	`22`	`+`
	`23`	`+ it('should find the clusters with equal distances', () => {`
	`24`	`+ const dataSet = [[0, 0], [1, 1], [2, 2]];`
	`25`	`+ const k = 3;`
	`26`	`+ const expectedCluster = [0, 1, 2];`
	`27`	`+ expect(kMeans(dataSet, k)).toEqual(expectedCluster);`
	`28`	`+ });`
	`29`	`+`
	`30`	`+ it('should find the nearest neighbour in 3D space', () => {`
	`31`	`+ const dataSet = [[0, 0, 0], [0, 1, 0], [2, 0, 2]];`
	`32`	`+ const k = 2;`
	`33`	`+ const expectedCluster = [1, 1, 0];`
	`34`	`+ expect(kMeans(dataSet, k)).toEqual(expectedCluster);`
	`35`	`+ });`
	`36`	`+});`

`‎src/algorithms/ml/kmeans/kmeans.js‎`

Lines changed: 98 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,98 @@`
	`1`	`+/**`
	`2`	`+ * Calculates calculate the euclidean distance between 2 vectors.`
	`3`	`+ *`
	`4`	`+ * @param {number[]} x1`
	`5`	`+ * @param {number[]} x2`
	`6`	`+ * @returns {number}`
	`7`	`+ */`
	`8`	`+function euclideanDistance(x1, x2) {`
	`9`	`+ // Checking for errors.`
	`10`	`+ if (x1.length !== x2.length) {`
	`11`	`+ throw new Error('Inconsistent vector lengths');`
	`12`	`+ }`
	`13`	`+ // Calculate the euclidean distance between 2 vectors and return.`
	`14`	`+ let squaresTotal = 0;`
	`15`	`+ for (let i = 0; i < x1.length; i += 1) {`
	`16`	`+ squaresTotal += (x1[i] - x2[i]) ** 2;`
	`17`	`+ }`
	`18`	`+ return Number(Math.sqrt(squaresTotal).toFixed(2));`
	`19`	`+}`
	`20`	`+/**`
	`21`	`+ * Classifies the point in space based on k-nearest neighbors algorithm.`
	`22`	`+ *`
	`23`	`+ * @param {number[][]} dataSet - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]`
	`24`	`+ * @param {number} k - number of nearest neighbors which will be taken into account (preferably odd)`
	`25`	`+ * @return {number[]} - the class of the point`
	`26`	`+ */`
	`27`	`+export default function kMeans(`
	`28`	`+ dataSetm,`
	`29`	`+ k = 1,`
	`30`	`+) {`
	`31`	`+ const dataSet = dataSetm;`
	`32`	`+ if (!dataSet) {`
	`33`	`+ throw new Error('Either dataSet or labels or toClassify were not set');`
	`34`	`+ }`
	`35`	`+`
	`36`	`+ // starting algorithm`
	`37`	`+ // assign k clusters locations equal to the location of initial k points`
	`38`	`+ const clusterCenters = [];`
	`39`	`+ const nDim = dataSet[0].length;`
	`40`	`+ for (let i = 0; i < k; i += 1) {`
	`41`	`+ clusterCenters[clusterCenters.length] = Array.from(dataSet[i]);`
	`42`	`+ }`
	`43`	`+`
	`44`	`+ // continue optimization till convergence`
	`45`	`+ // centroids should not be moving once optimized`
	`46`	`+ // calculate distance of each candidate vector from each cluster center`
	`47`	`+ // assign cluster number to each data vector according to minimum distance`
	`48`	`+ let flag = true;`
	`49`	`+ while (flag) {`
	`50`	`+ flag = false;`
	`51`	`+ // calculate and store distance of each dataSet point from each cluster`
	`52`	`+ for (let i = 0; i < dataSet.length; i += 1) {`
	`53`	`+ for (let n = 0; n < k; n += 1) {`
	`54`	`+ dataSet[i][nDim + n] = euclideanDistance(clusterCenters[n], dataSet[i].slice(0, nDim));`
	`55`	`+ }`
	`56`	`+`
	`57`	`+ // assign the cluster number to each dataSet point`
	`58`	`+ const sliced = dataSet[i].slice(nDim, nDim + k);`
	`59`	`+ let minmDistCluster = Math.min(...sliced);`
	`60`	`+ for (let j = 0; j < sliced.length; j += 1) {`
	`61`	`+ if (minmDistCluster === sliced[j]) {`
	`62`	`+ minmDistCluster = j;`
	`63`	`+ break;`
	`64`	`+ }`
	`65`	`+ }`
	`66`	`+`
	`67`	`+ if (dataSet[i].length !== nDim + k + 1) {`
	`68`	`+ flag = true;`
	`69`	`+ dataSet[i][nDim + k] = minmDistCluster;`
	`70`	`+ } else if (dataSet[i][nDim + k] !== minmDistCluster) {`
	`71`	`+ flag = true;`
	`72`	`+ dataSet[i][nDim + k] = minmDistCluster;`
	`73`	`+ }`
	`74`	`+ }`
	`75`	`+ // recalculate cluster centriod values via all dimensions of the points under it`
	`76`	`+ for (let i = 0; i < k; i += 1) {`
	`77`	`+ clusterCenters[i] = Array(nDim).fill(0);`
	`78`	`+ let classCount = 0;`
	`79`	`+ for (let j = 0; j < dataSet.length; j += 1) {`
	`80`	`+ if (dataSet[j][dataSet[j].length - 1] === i) {`
	`81`	`+ classCount += 1;`
	`82`	`+ for (let n = 0; n < nDim; n += 1) {`
	`83`	`+ clusterCenters[i][n] += dataSet[j][n];`
	`84`	`+ }`
	`85`	`+ }`
	`86`	`+ }`
	`87`	`+ for (let n = 0; n < nDim; n += 1) {`
	`88`	`+ clusterCenters[i][n] = Number((clusterCenters[i][n] / classCount).toFixed(2));`
	`89`	`+ }`
	`90`	`+ }`
	`91`	`+ }`
	`92`	`+ // return the clusters assigned`
	`93`	`+ const soln = [];`
	`94`	`+ for (let i = 0; i < dataSet.length; i += 1) {`
	`95`	`+ soln.push(dataSet[i][dataSet[i].length - 1]);`
	`96`	`+ }`
	`97`	`+ return soln;`
	`98`	`+}`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit b7cd425

File tree

4 files changed

4 files changed

`‎README.md‎`

`‎src/algorithms/ml/kmeans/README.md‎`

`‎src/algorithms/ml/kmeans/test/kmeans.test.js‎`

`‎src/algorithms/ml/kmeans/kmeans.js‎`

0 commit comments

File tree

4 files changed

4 files changed

‎README.md‎

‎src/algorithms/ml/kmeans/README.md‎

‎src/algorithms/ml/kmeans/__test__/kmeans.test.js‎

‎src/algorithms/ml/kmeans/kmeans.js‎

0 commit comments

`‎README.md‎`

`‎src/algorithms/ml/kmeans/README.md‎`

`‎src/algorithms/ml/kmeans/test/kmeans.test.js‎`

`‎src/algorithms/ml/kmeans/kmeans.js‎`