Inspired by this blog I went on implementing my own version as a F# learning challenge. It turned out to be quite different than the original (but somewhat faster for large samples).
The first code part below defines some test types and function delegates:
namespace FSLib
open System
// Test Type: Indexed Point in plane
type Point2D(x: float, y: float, i: int) =
member this.X = x
member this.Y = y
member this.Index = i
override this.ToString() = String.Format("{0}: [{1:F6}; {2:F6}]", this.Index, this.X, this.Y)
// Test Type: Indexed Point in space
type Point3D(x: float, y: float, z: float, i: int) =
member this.X = x
member this.Y = y
member this.Z = z
member this.Index = i
override this.ToString() = String.Format("{0}: [{1:F6}; {2:F6}; {3:F6}]", this.Index, this.X, this.Y, this.Z)
// Function Prototype/delegate for 'a: a1 < a2 => 1 else a1 > a2 => -1 else a1 = a2 => 0
type Comparer<'a> = 'a -> 'a -> int
// Function Prototype/delegate for a function that calculates the 'distance' of some kind between two instances of 'a
type DistanceFunc<'a> = 'a -> 'a -> float
// Function Prototype/delegate for a function calculating a new centroid from a sequence of 'a's - returns a tuple (index, 'a)
type CentroidCalculator<'a> = int -> 'a seq -> int * 'a
Then a generic type/class that runs the optimization on the provided data:
// Type/class definition/implementation of KMeanCluster
type KMeanCluster<'a when 'a : equality>(comparer : Comparer<'a>, distanceFunc : DistanceFunc<'a>, centroidCalculator : CentroidCalculator<'a>) =
let compare = comparer
let distance = distanceFunc
let calculateCentroid = centroidCalculator
// Returns the nearest centroid in argument centroids according to argument point
let nearestCluster point centroids =
centroids |> Seq.sortBy(fun p -> distance point p) |> Seq.head
// Returns a new list of cluster centroids by grouping the argument samples around the argument (old) centroids
let calculateCentroids samples centroids =
samples
|> Seq.groupBy(fun s -> nearestCluster s centroids)
|> Seq.mapi(fun i g -> calculateCentroid i (snd g))
|> Seq.sortBy(fun c -> fst c)
|> Seq.map(fun c -> snd c)
|> Seq.toList
// Checks if two lists of same type is pairwise equal: if not => true else false
let hasChanged list1 list2 =
match List.compareWith compare list1 list2 with
| 0 -> false
| _ -> true
// Runs the input data and returns the optimized cluster centroids
member this.Calculate seedCentroids samples =
let mutable clusterCentroids = seedCentroids |> List.map(fun p -> p)
let mutable newCentroids = calculateCentroids samples clusterCentroids
// This is an iterative process continueing until completed optimization
// ctor argument 'comparer' could have some kind of tolerance build in as it is responsible for
// ending the process
while hasChanged clusterCentroids newCentroids do
clusterCentroids <- newCentroids
newCentroids <- calculateCentroids samples clusterCentroids
newCentroids
Finally the client code and a sample generator function:
open System
open FSLib
let createData count =
let rand = Random(5)
let min = -500
let max = 500
[ for i in 1 .. count -> [| (float)(rand.Next(min, max)); (float)(rand.Next(min, max)); (float)(rand.Next(min, max)) |]]
// Test Case for FSLib.Point2D:
let kmc1_2D data initailCentroids =
// Converts the initialCentroids list of float[3] to list of Point2D
let seedCentroids = initailCentroids |> List.mapi(fun i (c : float[]) -> Point2D(c.[0], c.[1], i))
// Converts the data a sequence of Point2D objects
let samples = data |> Seq.mapi(fun i (d : float[]) -> Point2D(d.[0], d.[1], i))
seedCentroids |> Seq.iter(fun x -> printfn "%A" x)
printfn "\n"
// Compares two points: as our only concern is whether they are equal or not it returns either 1 (unequal) or 0 (equal)
let compare (point1 : Point2D) (point2 : Point2D) = if point1.X <> point2.X || point1.Y <> point2.Y then 1 else 0
// Calculates and returns the geometric squared distance between two points
let squaredDistance(point1 : Point2D) (point2 : Point2D) : float =
let dx = point1.X - point2.X
let dy = point1.Y - point2.Y
dx * dx + dy * dy
// Calculates and returns a tuple of argument index and the geometric average (centroid) of the argument points (index, centroid)
let calculateCentroid index points =
let mutable x = 0.0
let mutable y = 0.0
points |> Seq.iter(fun (p : Point2D) ->
x <- x + p.X
y <- y + p.Y)
let length = (float)(Seq.length points)
(index, Point2D(x / length, y / length, index))
// Instantiate an instance of KMeanCluster, calculates and prints the result
let kmean = KMeanCluster<Point2D>(compare, squaredDistance, calculateCentroid)
let result = kmean.Calculate seedCentroids samples
result |> List.iter(fun x -> printfn "%A" x)
printfn "\nEND 2D"
ignore
[<EntryPoint>]
let main argv =
let centroids = [ [| 0.; 0.; 0. |]; [| 20.; 30.; 40. |]; [| -40.; -50.; -60. |] ]
let data = createData 1000
kmc1_2D data centroids ignore
printfn "\nEND PROGRAM"
let k = Console.ReadLine()
0
I would like any comment on the F# language/functional programming specifics ideom, workflow etc. (don't waste time on error handling and the mathematics). As a OO-programmer I find it rather F#-ish, but as a F# specialist you may have another opinion?
1 Answer 1
1. F# there's such an amazing opportunity as partial application. So you don`t need write:
result |> List.iter(fun x -> printfn "%A" x)
just:
result |> List.iter printfn "%A"
Read more here about it
2.
In function hasChanged
no need to use match
, since only two possible Boolean variant:
let hasChanged list1 list2 =
Seq.compareWith compare list1 list2 <> 0
3.
F# have a strong functional component, it is better to do without the use of mutable variables (in function calculateCentroid
and Calculate
).
static member calculateCentroid index points =
let lng = points |> Seq.length |> float
points
|> Seq.fold
(fun acc v -> {acc with X = acc.X + v.X; Y = acc.Y + v.Y})
{X = 0.0; Y = 0.0; Index = index}
|> fun v -> v.Index, {v with X = v.X / lng; Y = v.Y / lng}
4. You can use methods such as List.init and Array.init, it is more convenient to generate the initial data:
let createData count z =
let rand = Random(5)
let min = -500
let max = 500
List.init count
(fun _ -> Array.init z (fun _ -> rand.Next(min, max) |> float))
5.
Method kmc1_2D
has a lot of features that are logically associated with a Point2D. Also, a side effect in the form of console output. Functions better to make a separate module or make them members of a type.
Edit Removed Index
.
So as calculateCentroid
is average, then if you add a few operators:
static member (+) (point1 : Point2D, point2 : Point2D) =
{X = point1.X + point2.X; Y = point1.Y + point2.Y}
static member Zero = {X = 0.0; Y = 0.0}
static member DivideByInt (point: Point2D, number: int) =
let fnum = float number
{X = point.X / fnum; Y = point.Y / fnum}
you can write:
static member calculateCentroid (points: seq<Point2D>) =
points
|> Seq.average
Given this, your code can be modified as follows:
Module Point2D:
module Point2D
open System
type Point2D =
{X:float; Y:float}
with
override this.ToString() =
String.Format("[{0:F6}; {1:F6}]", this.X, this.Y)
static member (+) (point1 : Point2D, point2 : Point2D) =
{X = point1.X + point2.X; Y = point1.Y + point2.Y}
static member Zero = {X = 0.0; Y = 0.0}
static member DivideByInt (point: Point2D, number: int) =
let fnum = float number
{X = point.X / fnum; Y = point.Y / fnum}
static member compare (point1 : Point2D) (point2 : Point2D) =
if point1.X <> point2.X || point1.Y <> point2.Y
then 1 else 0
// Calculates and returns the geometric squared distance between two points
static member squaredDistance (point1 : Point2D) (point2 : Point2D) : float =
let dx = point1.X - point2.X
let dy = point1.Y - point2.Y
dx * dx + dy * dy
// Calculates and returns a tuple of argument index and the geometric average (centroid) of the argument points (index, centroid)
static member calculateCentroid (points: seq<Point2D>) =
points
|> Seq.average
Module Point3D:
module Point3D
open System
type Point3D =
{X:float; Y:float; Z:float}
with
override this.ToString() =
String.Format("[{0:F6}; {1:F6}; {2:F6}]",
this.X, this.Y, this.Z)
static member (+) (point1 : Point3D, point2 : Point3D) =
{X = point1.X + point2.X; Y = point1.Y + point2.Y ; Z = point1.Z + point2.Z}
static member Zero = {X = 0.0; Y = 0.0; Z = 0.0}
static member DivideByInt (point: Point3D, number: int) =
let fnum = float number
{X = point.X / fnum; Y = point.Y / fnum ; Z = point.Z / fnum}
static member compare (point1 : Point3D) (point2 : Point3D) =
if point1.X <> point2.X || point1.Y <> point2.Y || point1.Z <> point2.Z
then 1 else 0
// Calculates and returns the geometric squared distance between two points
static member squaredDistance (point1 : Point3D) (point2 : Point3D) : float =
let dx = point1.X - point2.X
let dy = point1.Y - point2.Y
let dz = point1.Z - point2.Z
dx * dx + dy * dy + dz * dz
// Calculates and returns a tuple of argument index and the geometric average (centroid) of the argument points (index, centroid)
static member calculateCentroid (points: seq<Point3D>) =
points
|> Seq.average
Module FSLib
module FSLib
// Function Prototype/delegate for 'a: a1 < a2 => 1 else a1 > a2 => -1 else a1 = a2 => 0
type Comparer<'a> = 'a -> 'a -> int
// Function Prototype/delegate for a function that calculates the 'distance' of some kind between two instances of 'a
type DistanceFunc<'a> = 'a -> 'a -> float
// Function Prototype/delegate for a function calculating a new centroid from a sequence of 'a's - returns a tuple (index, 'a)
type CentroidCalculator<'a> = 'a seq -> 'a
// Type/class definition/implementation of KMeanCluster
type KMeanCluster<'a when 'a : equality>(comparer : Comparer<'a>, distanceFunc : DistanceFunc<'a>, centroidCalculator : CentroidCalculator<'a>) =
let compare = comparer
let distance = distanceFunc
let calculateCentroid = centroidCalculator
// Returns the nearest centroid in argument centroids according to argument point
let nearestCluster point centroids =
centroids
|> Seq.sortBy (distance point)
|> Seq.head
// Returns a new list of cluster centroids by grouping the argument samples around the argument (old) centroids
let calculateCentroids samples centroids =
samples
|> Seq.groupBy(fun s -> nearestCluster s centroids)
|> Seq.map(snd >> calculateCentroid)
|> Seq.toList
// Checks if two lists of same type is pairwise equal: if not => true else false
let hasChanged list1 list2 =
Seq.compareWith compare list1 list2 <> 0
// Runs the input data and returns the optimized cluster centroids
member this.Calculate seedCentroids samples =
let rec calculate clusterCentroids newCentroids =
if hasChanged clusterCentroids newCentroids then
calculateCentroids samples newCentroids
|> calculate newCentroids
else
newCentroids
calculateCentroids samples seedCentroids
|> calculate seedCentroids
Test:
open System
open FSLib
open Point2D
open Point3D
let createData count z =
let rand = Random(5)
let min = -500
let max = 500
List.init count
(fun _ -> Array.init z (fun _ -> rand.Next(min, max) |> float))
// Test Case for Point2D:
let kmc1_2D (data: float [] list) (initailCentroids: float [] list) =
let seedCentroids: Point2D list =
initailCentroids
|> List.mapi
(fun i c -> {X = c.[0];Y = c.[1]})
let samples: Point2D list =
data
|> List.mapi
(fun i d -> {X = d.[0]; Y = d.[1]})
let kmean = KMeanCluster(Point2D.compare, Point2D.squaredDistance, Point2D.calculateCentroid)
let result = kmean.Calculate seedCentroids samples
result
// Test Case for Point3D:
let kmc1_3D (data: float [] list) (initailCentroids: float [] list) =
let seedCentroids: Point3D list =
initailCentroids
|> List.mapi
(fun i c -> {X = c.[0];Y = c.[1]; Z = c.[2]})
let samples: Point3D list =
data
|> List.mapi
(fun i d -> {X = d.[0]; Y = d.[1]; Z = d.[2]})
let kmean = KMeanCluster(Point3D.compare, Point3D.squaredDistance, Point3D.calculateCentroid)
let result = kmean.Calculate seedCentroids samples
result
let centroids = [ [| 0.; 0.; 0. |]; [| 20.; 30.; 40. |]; [| -40.; -50.; -60. |] ]
let data2 = createData 1000 3
kmc1_2D data2 centroids
|> Seq.map (string)
|> Seq.iter (printfn "%s")
printfn "\nEND 2D"
let data3 = createData 1000 3
kmc1_3D data3 centroids
|> Seq.map (string)
|> Seq.iter (printfn "%s")
printfn "\nEND 3D"
printfn "\nEND PROGRAM"
Console.ReadKey(true)
|> ignore
-
\$\begingroup\$ @FoggyFiner: Thanks a lot for your thorough answer. Sorry for the late response, but maybe we live in different time zones? :-). I'll study it later in detail and then give some feed back. As for now, I can see you hit me with 3. about mutable variables. I really try to avoid them, but it's hard not to "fall back" to OO-thinking now and then. But I'm learning... \$\endgroup\$user73941– user739412016年07月11日 07:09:10 +00:00Commented Jul 11, 2016 at 7:09
-
\$\begingroup\$ I wrongly made a response to you as an answer below. \$\endgroup\$user73941– user739412016年07月11日 10:47:24 +00:00Commented Jul 11, 2016 at 10:47
-
\$\begingroup\$ Firstly you have some good points about the point types and the placement of point specific function. I would have done the same in my primary "world" (C#). I didn't here because I anticipated that in F# minimal date types is preferrable. (F# seems to be "bendable" towards nearly any paradigm and that's a little confusing for a beginner) \$\endgroup\$user73941– user739412016年07月11日 10:58:35 +00:00Commented Jul 11, 2016 at 10:58
-
\$\begingroup\$ Secondly your list handling is excellent and shows a lot of details that I can learn from. Especially how to avoid mutable variables and how specific functions like Seq.fold works (I really like the List.init in createData). Much of the learning process in F# is apparently to understand sequence handling, and you give some very good examples. \$\endgroup\$user73941– user739412016年07月11日 10:59:17 +00:00Commented Jul 11, 2016 at 10:59
-
1\$\begingroup\$ okay, I done it :) \$\endgroup\$user110704– user1107042016年07月12日 16:00:51 +00:00Commented Jul 12, 2016 at 16:00
Index
in thePoint2D
,Point3D
. Are you sure it is necessary? In my opinion, is a bit of "white crow". \$\endgroup\$