Zobrazují se příspěvky se štítkemMachine Learning. Zobrazit všechny příspěvky

čtvrtek 19. března 2015

Neural Networks F#, XOR classifier and TSP Hopfield solver

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

It seems that recently thanks to the buzz around Deep Learning, Neural Networks are getting back the attention that they once had. This post contains just a very short introduction to Neural Networks, just enough to understand the F# code which follows. The aim for me is to learn F# and see how easy it can be to write Machine Learning code in F#.

I present here two examples of NN in F#:

XOR classifier - sort of Hello World of NN. A simple composition of 2 input nodes, 2 hidden nodes and 1 output node which learns the XOR function.
Solution to TSP using Hopfield NN. More complex network which evolves over time in it's configuration space to it's local minima, which shall correspond to optimal TSP tour.

Both working examples are available at my GitHub

Neural Networks

Artificial neural networks are computational models with particular properties such as the ability to adapt or learn, to generalise or to cluster and organise data. These models are inspired by the way that the human brain works. The building blocks of Neural Networks are inspired by Neurons - the building blocks of our brain. Neural Network can be seen as interconnected graph of nodes, where each node takes a role of a neuron. Each neuron can receive multiple signals, modify it and send it to other neurons to which it is connected. In this graph some vertexes are used to set the input, some of them are used to perform the computation and other onces will hold the output values.

Nodes are connected by edges which have different weights. The weight of the edge specifies how the signals is modified when passing from one node to another.

Perceptron

Perceptron is a basic block of linear neural networks. It is composed of multiple input nodes and a decision node. Perceptron is a single layer network. Input values are linked directly to the output node, connected by edges with certain weights.

The ouput node takes the incomming values, sums them and based on the result returns an output. The output could be binary or continous value. For instance a threashold can be used to say whether the output is 1 or 0.

In practice, very often the Sigmoid or also called logistic function is used to output value within the interval [0,1].

Linear separability

Single layer perceptrons can only learn linerearly separable data.

Imagine we want to separate a set of datapoints into multiple clusters. We visualize the data in 2-dimensional euclidian space. If we can separate the cluster by drawing direct line between the data points, the data is linearly seperable. The same can be applied on multi-dimensional data

In a world of logical functions (we are building XOR classifier), this limitation means that a single layer perceptron is able to learn AND or OR function but it won't be able to learn XOR function.

One can easily imagine a line that separates all OR positive results from the negatives once. On the other hand there is no streight line that separates the positive XOR results ([0,1] and [1,0]) from the negatives ([0,0] and [1,1])

Feed forward multilayer networks

Feed forward network can be thought as composition of perceptrons. Such network has one input layer, multiple hidden layers and one output layer. The feed forward network does not contain cycles, unlike the Hopfield network in the next examples (which is recurrent neural network).

Unlike single layer perceptron, multilayer feed forward network is capable of learning linerably non-separable data such as the results of XOR function.

XOR classifier

This is the first example. This network is called clasifier because it learns the XOR function. It can then "classify" the 2 values in the input into single value on the output.

Here is how the network looks like:

The NN diagram was drawn in latex, using the tkz-berge package. Look here if you want to know how.

The network starts with random weights and continously updates the weights so that the result in the output node corresponds to the XOR of input values. In order to update the weights Backpropagation (Backwards propagation of errors) technique is used.

The value of the output is compared to the real value (which is known). The error is determined as the differences between the output and the real value. Then the weights of the edges has to be updated to minimalize the error as shown later.

Implementation

Each layer can be represented as one dimensional array, the weights are stored using 2-dimensional array ([0,1] is the weight between nodes 0 and 1). Here is the function which calculates the sum of values incomming to a single node and passes the value to the activation function. The activation function here can be any function, as it is passed as parameter but here Sigmoid is used.

let pass (input:float[]) (weights:float[,]) activation =
let length = weights |> Array2D.length2
seq {
 for i in 0 .. length-1 do
 let sum = (Array.zip weights.[*,i] input) |> Array.sumBy (fun (v,w) -> v * w)
 yield activation sum
} |> Array.ofSeq

By applying the pass multiple times, the whole network can be composed as sequence of pass functions.

The training procedure

The following function is the main loop of the XOR network and we iterate over until we obtain good results and the network is adapted.

let train network rate input target =
 let n1 = completepass input network
 let delta = deltaOutput n1.output target
 let deltaHidden = passDelta n1.hidden delta n1.hiddenToOutput
 let updatedHiddenToOut = updateWeights n1.hidden delta n1.hiddenToOutput rate
 let updatedInToHidden = updateWeights n1.input deltaHidden n1.inputToHidden rate

Completepass just calls 2 times the pass function value and gets the input values trough the hidden layer to the output. The output is then compared to desired result and error is estimated. From this error an array of "delta" values per each output node is determined which is then used to update the weights.

In the case of XOR network, there is only one output, so the delta will one dimensional array.

The delta has to be propagated lower so that we can also update the weights between the input and hidden layer.

Backpropagating the error

First the error of each layer has to be calculated. In the example bellow the error of the output layer is the value (t-o) or (Target - Ouput). This value is multiplied by value o*(1-o) which is the derivation of the Sigmoid function. The resulting array contains for each node a delta value, which shall be used to adjust the weights.

let deltaOutput (output:array<float>) (target:array<float>) =
 (Array.zip output target) |> Array.map (fun (o,t) -> o * (1.0 - o) * (t - o))

Delta propagation

Calculating the delta value for the output layer is not sufficient to correct all the weights in the network, the delta has to be propagated to the lower layers of the network, so that we can update the weights on the input - hidden connections.

let passDelta (outputs:float[]) (delta:float[]) (weights:float[,]) =
 let length = weights |> Array2D.length1
 seq {
 for i in 0 .. length-1 do
 let error = (Array.zip weights.[i,*] delta) |> Array.sumBy (fun (v,w) -> v * w)
 yield outputs.[i] * (1.0 - outputs.[i]) * error
 } |> Array.ofSeq

The "error" of the hidden layer is just the delta multiplied by the weight associated to the edge that propagates this delta to the lower value.

The last missing piece is the function that woudl update weights between 2 layers.

let updateWeights (layer:float[]) (delta:float[]) (weights:float[,]) learningRate =
 weights |> Array2D.mapi (fun i j w -> w + learningRate * delta.[j] * layer.[i])

Learning rate is a constant that determines how quickly the edges are updated. If the learning rate is too big, the network might miss the optimal weights. However if it is too small it will take longer time to get to the correct solution.

At the beginning the weight are set to random values between 0 and 1, or slided slightly. From the quick tests that I did it seems that for instance using learning rate of 0.3 it took 20000 iterations to get to a neural network that would XOR the values on it's inputs.

Hopfield-Tank network and Travelling Salesman Problem

TSP is one of the well known Combinatorial optimization problems and as such, it has been solved in many different ways (Integer Linear Programming, Genetic or Biologically inspired algorithms and other heuristics). Neural Networks are one of the many approaches to provide a solution to this problem.

Even within Neural Networks several different approaches have been developed to solve TSP (eg. elastic nets,self-organizing map). The approach demonstrated here is the oldest one: Hopfield neural network.

Hopfield neural network

Hopfield NN is a recurrent neural network (connections in the network form cycles). The units (nodes) in a Hopfield network can have only 2 values: 0|1 or -1|1 depending on literature, but in either case values that encode a binary state. The network is fully connected (each node is connected with any other node). The connection matrix is symetric, that is the weights on the edges are the same in both directions.

The initial Hopfied network had only binary values (0/1) but to solve TSP and for other problems continous version of the network is used, where every node has value in range [0,1].

The value of each node depends on the input potential of the node (Ui) and in order to keep it between (0,1) the tanh function is used:

The value of the input potential of each node depends on the values of all nodes that are connected to it and the values of the connections weights. This is actually identical to the way that the nodes in the XOR classifier behaved.

Network energy

Each state of the HNN can be described by a single value called Energy. While iterating and changing the state, the energy of HNN will either stay the same or will get smaller. The energy of the network will evenutelly convert to a local minimum - the state of the HNN that we target to solve the TSP

Hopfiled approach to TSP

The first and original approach to solve TSP was described in 1985 by Tank and Hopfield in their paper "Neural Computation of Decisions in Optimization Problems". Since then many more attempts were made. I have found these few papers or surweys available online quite useful and have used them to deduce my implementation:

If I should pick one, I would say that the first one gave me most information about the problem, event though not enough implementation details. The others once then made it (almost) clear how to implement the solution.

Encoding TSP in Hopfield network

The neural network used to encode the TSP and it's solution is square matrix of nodes having N rows (one for each city) and N columns (one for each position in the tour), where N is the number of cities. When the algorithm finishes and the network converges to it's final state, each node will have value of either 0 or 1. As said the network in fully inter-connected so there is an edges between each node, which is not shown in the image bellow.

If the node in i-th row and j-th column has value 1, the city (i) will be in the tour on position (j).

As described above the network evolves and each node changes it's value over time. The secret of the algorithm is thus to provide such update rule that at the end the matrix of nodes will follow these criteria:

Only one node in each row will have value 1
Only one node in each column will have value 1
There will be exactly N nodes having value 1
The distances of the tour will be minimal

To come up with such update rule, Energy function has to be determine which will attain it's minimum for optimal solution of TSP. The energy function has to take into account the above specified rules. The following definition will satisfy the rules. Note that the following equations were taken from the J.Y Potvin's - paper

A,B,C and D are constants and the 4 sumations here correspond to the 4 above mentioned points.

Describing the dynamic of the network

There are two ways to describe the behaviour and the dynamic of the network:

From the TSP energy function and the standard Hopfield network energy function we determine the weights between the nodes. Then we can let the network evolve over time. Since we have the weights of the connections between the nodes, we could iterate over the nodes, sum all the weighted inputs and set the output of the node. That would be an approach similar to the one taken on the XOR. This has one disadvantage, that is that the connection weights have to be determined from the equality between the standard Hopfield model enery function and the TSP Hopfield model function. There is an easier way without determining the connections weights.
The easier way is to describe the change of the input potential without determining the weights of the edges. Then we can just iterate or randomly choose nodes, determine the new value of the potential and update the values of the connected nodes. This seems maybe ackward, because the network will evolve without having specific weighed connections between nodes. This options is described further.

The change in the node's input potential

One of the definition of the input potential of a node (i) can be described by:

What we want to determine, is how this value will change over time, when the network evolves. The change of the potential over time is it's partial derivation with respect to the time.

When doing a computer simulation we will described the value of Ui in time T+1, or here T+ delta t. Where delta T is some small time interval.

This equation is at the hearth of this algorithm. We now have an exact specification of what the value of potential Ui will be in time T+1. That is our iterating algorithm has to perfrom exactly this updated.

Implementation details

Compute the distance matrix, containing the distance between each 2 cities
Initialize the network with random values
Pick randomly a node, compute the change in the input potential
Update the value of the node from the input potential change
Every n-iterations calculate the energy of the network
Once the energy of the network stabilizes or you have reached fixed number of iterations - stop the process
Translate the network into a city path

The network can be stored in 2-dimensional matrix. One can choose between storing the input potential of each node or the value of the each node, because the value of the node can be calculated from the potential at any time. I have chosen the second option and I store the input potential two dimensional array u.

The following code is the calculation of the input potential change of single node at coordinates (city,position). This is just retranscription of the equation above.

let singlePass (distances:float[,]) (u:float[,]) pms city position = 
 let n = Array2D.length2 u
 
 let values = u |> toValues pms
 let aSum = sumAllBut position (values |> rowi city)
 let bSum = sumAllBut city (values |> coli position)
 
 let cSum = (values |> Seq.cast<float> |> Seq.sum) - float(n+1)
 let dSum = dSumCalc distances city position values
 let dudt = -pms.A*aSum - pms.B*bSum - pms.C*cSum - pms.D*dSum
 //value of input potential in t+ delta_t
 let r = u.[city,position] + pms.dTime*(-u.[city,position] + dudt)
 //Alternatively according to the paper by Jacek Mandziuk one can just use the update value
 //let r = dudt
 r

There are few things to note. This code takes the distances matrix, the current state of the network (the value of input potential of each node), parameters of the network (constants A,B,C,D from equations above) and row (city) and the column (position) of the node that we are updating. The toValues method takes the current value of each node potential and returns a matrix of node values. The rowi and coli method return respecively one row or one column from 2 dimensional array. The sumAllBut method adds all elements of one-dimensional array except an element at position which is passed to the method. The dSumCalc method is the only one with a bit more compexity and it calculates the D value of the equation above (the one that assures the minimalization of the TSP circuit)

let rowi row (network:float[,]) = 
 network.[row,*] |> Array.mapi (fun j e -> (e,j))
let coli col (network:float[,]) = 
 network.[*,col] |> Array.mapi (fun j e -> (e,j))
let sumAllBut (i:int) (values:(float*int)[]) = 
 Array.fold (fun acc (e,j) -> if i<>j then acc + e else acc) 0.0 values
let dSumCalc distances city position (v:float[,]) = 
 let n = v |> Array2D.length1
 (distances |> rowi city) |> Array.sumBy (fun (e,i) -> 
 let index1 = (n+position+1) % n
 let index2 = (n+position-1) % n
 e*(v.[i,index1] + v.[i,index2])
 )
let toValues (pms:HopfieldTspParams) u = 
 u|> Array2D.map (fun (ui) -> v ui pms)
//calculates the value of node from input potential
let v (ui:float) (parameters:HopfieldTspParams) = (1.0 + tanh(ui*parameters.alfa))/2.0

The method which updates the input potential of single node can be called in 2 different ways. Either we pick the nodes randomly multiple times or we loop over all the nodes serially. If the update is serial then the only random element of the algorithm is the initialization of the netwok.

let serialIteration u pms distances = 
 u |> Array2D.mapi (fun i j x -> singlePass distances u pms i j)
let randomIteration u pms distances = 
 let r = new Random(DateTime.Now.Millisecond)
 let n = Array2D.length1 u
 for i in 0 .. 1000*n do
 let city = r.Next(n)
 let position = r.Next(n)
 u.[city, position] <- singlePass distances u pms city position
 u

The random iteration here is repeated 1000*n times where n is the number of cities, which is probably more than enough since the network seems to converge much sooner. Just for the sake of completeness, here is the method that runs 10 times either serial or random iteration.

let initAndRunUntilStable cities pms distances = 
 let u = initialize cities pms
 {1 .. 10} |> Seq.fold (fun uNext i -> 
 match pms.Update with
 | Random -> randomIteration uNext pms distances
 | Serial -> serialIteration uNext pms distances
 ) u

And here high level method, that generates a random example of TSP problem, calculates distances between all cities and runs the algorithm until a stable and correct solution is found. That is until the network returns feasable solution.

let sampleRun (pms:HopfieldTspParams ) (n:int) =
 let cities = generateRandomCities n
 let distances = calculateDistances cities
 let networks = Seq.initInfinite (fun i -> initAndRunUntilStable cities pms distances)
 let paths = networks |> Seq.map (fun v-> currentPath v)
 let validPath = paths |> Seq.find (fun path -> isFeasable path)
 cities, validPath

Results

The results are not overwhelming, on the other hand I have only implemented the most simple version of the algorithm. In the original paper the authors stated that the convergence rate to feasable solutions was about 50%. But depending on the parameters one can get better results. See bellow one of the runs of the algorithm on 6 random nodes.

Here is the complete list of parameters for Hopfield network and the values that I ended up using:

Alpha: the parameters which shapes the Sigmoid decision function applied on each node (500)
A: input potential change function, sum of rows (500)
B: input potential change function, sum of columns (500)
C: input potential change function, max N cities (200)
D: input potential change function, minimalization of tour length (300)
dTime: represents the delta T, the small update in time for each update of the input potential value (0.00001)

I have also used the standard update rule to obtain new value of the input potential which takes into account the current input potential.

 let dudt = -pms.A*aSum - pms.B*bSum - pms.C*cSum - pms.D*dSum
 //value of input potential in t+ delta_t
 u.[city,position] = u.[city,position] + pms.dTime*(-u.[city,position] + dudt)

According to the paper by Jacek Mandziuk one can just use the updated values as the new input potential, so that the update rule would become only:

 let dudt = -pms.A*aSum - pms.B*bSum - pms.C*cSum - pms.D*dSum
 u.[city,position] = dudt;

This rule didn't work for me. The convergance rate wasn't better neither were the tours lengths. Of course for such version, different values for network parameters have to be used.

Note that GitHub repostity and specifialy the Hopfield module contais more code:

Method to determine the correct parameters. The performance of the algorithm greatly depends on the parameters of the Energy function and on the parameter alfa, which amplyfies the effect of the input potential on the value of given node.
Few lines are also available to draw the solution using Windows Forms Charts, through F#

Posted by Jan Fajfr Žádné komentáře:

sobota 31. května 2014

Detecting extreme values in SQL

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

In a set of data points, outliers are such values that theoretically should not appear in the dataset. Typically these can be measurement errors or values caused by human mistakes. In some cases outliers are not caused by errors. These values affect the way that the data is treated and any statistics or report based on data containing outliers are erroneaus.

Detecting these values might be very hard or even impossible and a whole field of statistics called Robust Statistics covers this subject. If you are further interested into the subject please read Quantitative Data Cleaning For Large Databases written by Joseph M. Hellerstein from UC Barkeley. Everything that I have implemented here is taken from this paper. The only thing that I have added to that are two aggregates for SQL Server which help efficiently get the outliers and extreme values from the data stored in SQL Server and a simple tool to chart data and distribution of data using JavaScript

Theory

Any dataset can be characterized by the way the data is distributed over the whole range. The probability that a single point has given value in the dataset is defined using the probability distribution function. The Gaussian standard distribution is only one among many distribution functions, I won't go into statistics basics here, but let's consider only the standard distribution for our case.

In the Gaussian distribution the data points are somehow gathered around the "center", and most values fall not far. Rare are the values really far away from the center. Intuitively the ouliers are points very far from the center. Consider the following set of numbers which represent in minutes the length of a popular song:

3.9,3.8,3.9,2.7,2.8,1.9,2.7,3.5, 4.4, 2.8, 3.4, 8.6, 4.5, 3.5, 3.6, 3.8, 4.3, 4.5, 3.5,30,33,31

You have probably spotted the values 30,33 and 31 and you immediately identify them as outliers. Even if the Doors would double the length of their keyboard solo we would not get this far.

The standard distribution can be described using the probability density function. This function defines the probability that the point will have given value. The function is defined with two parameters: the center and the dispersion. The center is the most common value, the one around which all others are gathered. The dispersion describes how far the values are scattered from the center.

The probability that a point has a given value, provided that the data has the Gaussian distribution is given by this equation:

We can visualize both the theoretical and the real distribution of data. The distribution probability density function is continuous and thus can be charted as simple line. The distribution of the real data in turn can be visualized like a histogram.

The following graphics were created using KoExtensions project, which is a small mediator project making Knockout and D3 work nicely together.

In perfect world the center is the mean value. That value is probably not a part of a data set but represents the typical value. The seconds measure which describes how far from the center the data is dispersed is called standard deviation. If we want to obtain the standard deviation from data we take the distance of each point from the center, square these values add them together and take square root. So we actually have all we need to get the distribution parameters from the data.

This approach of course has one main flaw. The mean value is affected by the outliers. And since the dispersion is deduced using the mean and outliers affect the mean value, the dispersion as well will be affected by them. In order to get a description of the dataset not affected by the extreme values one needs to find robust replacements for the mean and the dispersion.

Robust Center

The simplest and very efficient replacement for the mean as the center of the data is the median. Median is such value that half of the points in the dataset are smaller are bellow. That is if the data set consists of even number of samples, we just need to order the values and take mean of the two values in the middle of the ordered array. If the data consist of odd number of values then we take the element exactly in the middle of the ordered data. The before mentioned paper describes two more alternatives: trimmed mean, winsorized mean. Both of these are based on the exclusion of marginal values and I did not used them in my implementation.

Let's take the median of the given dataset and see if the distribution function based on it fits better the data. Even though the center is now in correct place the shape of the function does not fit completely the data from the histogram. That is because the variance is still affected by the outliers.

Robust Dispertion

Standard variance takes into account the the distance of all the numbers from the center. To rule out the extreme values, we can just use the median of distances. The outlier's distance from the center is much bigger than other distance and by taking the median of all distances we can get rid of outlier's influence over the dispersion. Here is the distribution function using this Robust type of dispersion. This characteristic is called MAD - Median Absolute Deviation.

Detecting the outliers

Now that we have the value of "real" center and "real spread" or dispersion we can state that the outlier is a value which differs "too much" from the center, taking into account the dispersion. Typically we could say that the outliers are such values that have a distance from center greater or equal to 10 * dispersion. The question is how to specify the multiplication coefficient. There is a statistics method called Hampel Indetifier which gives a formula to obtain the coefficient. Hampel identifier labels as outliers any points that are more than 5.2 away from the MAD. More details can be found here.

The overall reliability of this method

A question which might arise is up to which kind of messy data this method can be used. A common intuition would say that definitely more than the half of the data has to be "correct", in order to be able to detect the incorrect ones. To be able to measure the robustness of each method of detecting outliers, statisticians have introduced a term called Breakdown point. This point states which percentage of the data can be corrupted in order for given method to work. Using the Median as the center with the MAD (Median Absolute Deviation) has a breakdown point of 1/2. That is this method works if more than half of the data is correct. The standard arithmetic mean has a BP = 0. It is directly affected by all the numbers and one single outlier can completely move the data.

SQL implementation

In order to implement detection of outliers in SQL, one needs to first have the necessary functions to compute the mean, median and dispersion. All these functions are aggregates. Mean (avg) and Dispersion (var) are already implemented in SQL Server. If you are lucky enough to use SQL Server 2012 you can use the built-in median aggregate as well. The robust dispersion however has to be implemented manually even on SQL Server 2012.

Implementing aggregates for SQL Server is very easy, thanks to the predefined Visual Studio template. This templates will create a class for you which implements the IBinarySerializable interface and is decorate with couple attributes defined in the Microsoft.SqlServer namespace.

This class has 4 important methods:

Init - anything needed before starting the aggregation
Accumulate - adding one single value to the aggregate
Merge - merging two aggregates
Terminate - work to be done before returning the result of the aggregate

Here is the example of the Median aggregate

private List ld;
public void Init()
{
 ld = new List();
}
public void Accumulate(SqlDouble value)
{
 if (!value.IsNull)
 {
 ld.Add(value.Value);
 }
}
public void Merge(Median group)
{
 ld.AddRange(group.ld.ToArray());
}
public SqlDouble Terminate()
{
 return Tools.Median(ld);
}

Note that some aggregates can be computed iteratively. In that case all the necessary logic is in the Accumulate method and the Terminate method can be empty. With Median this is not the case (even though some iterative estimation methods exist. For the sake of the completeness, here is the implementation of median that I am using. It is the standard way: sorting the array and taking the middle element or average of the two middle elements. I am returning directly SqlDouble value, which is the result of the aggregate.

public static SqlDouble Median(List ld)
{
 if (ld.Count == 0)
 return SqlDouble.Null;
 ld.Sort();
 int index = ld.Count / 2;
 if (ld.Count % 2 == 0)
 {
 return (ld[index] + ld[index - 1]) / 2;
 }
 return ld[index];
}

Implementing the Robust variance using the MAD method is very similar, everything happens inside the Terminate method.

public SqlDouble Terminate()
{
 var distances = new List();
 var median = Tools.Median(ld);
 foreach (var item in ld)
 {
 var distance = Math.Abs(item - median.Value);
 distances.Add(distance);
 }
 var distMedian = Tools.Median(distances);
 return distMedian;
}

That implementation is directly the one described above: we take the distance of each element from the center (median) and than we take the median of the distances.

Outliers detection with the SQL aggregates

Having implemented both aggregates, detecting the outlier is just a matter of a SQL query - giving all the elements which are further away from the center than the variance multiplied by a coefficient.

select * from tbUsers where Height> ( Median(Height) + c*RobustVar(Height)) or Height < (Median(Height) - c*RobustVar(Height))

You will have to play with the coefficient value c to determine which multiplication gives you the best results.

JavaScript implementation

The same can be implemented in JavaScript. If you are interested in a JavaScript implementation you can check out the histogram chart from KoExtensions. This charting tool draws the histogram and the data distribution function. You can than configure it to use either Median or Mean as the center of the data as well as to use MAD, or standard variance to describe the dispersion.

KoExtensions is based on Knockout.JS and adds several useful bindings and the majority of them to simplify charting. Behind the scenes the data is charted using D3.

To draw a histogram chart with the distribution and detecting the outliers at the same time one needs just few lines of code

<div id="histogram" data-bind="histogram: data, chartOptions : {
 tolerance : 10,
 showProbabilityDistribution: true,min : -20,
 expected: 'median',
 useMAD: true,
 showOutliers: true}">

var exData = [3.9,3.8,3.9,2.7,2.8,1.9,2.7,3.5, 4.4, 2.8, 3.4, 8.6, 4.5, 3.5, 3.6, 3.8, 4.3, 4.5, 3.5,30,33,31];
 
function TestViewModel() {
 var self = this;
 self.data = ko.observableArray(exData);
}
var vm = new TestViewModel();
ko.applyBindings(vm);
initializeCharts();

Knockout.JS is a JavaScript MVVM framework tool which gives you all you need to create bi-directional binding between the view and the view model, where you can encapsulate and unit test all your logic. KoExtensions adds a binding call "histogram", which takes simple array and draws a histogram. In order to show the probability function and the outliers one has to set the options of the chart as shown in the example above.

Posted by Jan Fajfr Žádné komentáře:

sobota 12. listopadu 2011

Universal Naive Bayes Classifier for C#

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

This post is dedicated to describe the internal structure and the possible use of Naive Bayes classifier implemented in C#.

I was searching for a machine learning library for C#, something that would be equivalent to what WEKA is to Java. I have found machine.codeplex.com but it did not include the Bayesian classification (the one in which I was interested). So I decided to implement it into the library.

How to use it

One of the aims of machine.codeplex.com is to allow the users to use simple POCO's for the classification. This can be achieved by using the C# attributes. Take a look at the following example which treats categorization of payments, based on two features: Amount and Description.
First this is the Payment POCO object with added attributes:

public class Payment
{
 [StringFeature(SplitType = StringType.Word)]
 public String Description { get; set; }
 [Feature]
 public Decimal Amount { get; set; }
 [Label]
 public String Category { get; set; }
}

And here is how to train the Naive Bayes classifier using a set of payments and than classify new payment.

var data = Payment.GetData(); 
NaiveBayesModel<Payment> model = new NaiveBayesModel<Payment>();
var predictor = model.Generate(data);
var item = predictor.Predict(new Payment { Amount = 110, Description = "SPORT SF - PARIS 18 Rue Fleurus" });

After the execution the item.Category property should be set to a value based on the analysis of the previously supplied payments.

About Naive Bayes classifier

This is just small and simplify introduction, refer to the Wikipedia article for more details about Bayesian classification.

Naive Bayes is a very simple classifier which is based on a simple premise that all the features (or characteristics) of classified items are independent. This is not really true in the real life, that is why the model is called naive.
The total probability of a item having features F1, F2, F3 being of category "C1" can be expressed as:

p(F1,F2,F3|C1) = P(C1)*P(F1|C1)*P(F2|C1)*P(F3|C1)

Where P(C1) is the A priory probability of item being of category C1 and P(F1|C1) is the Posteriori probability of item being of category C1 when it has feature F1.
That is simple for binary features (like "Tall", "Rich"...). For example p(Tall|UngulateAnimal) = 0.8, says that the posteriori probability for an animal to be and ungulate is 0.8, when it is a tall animal.

If we have continuous features (just like the "Amount" in the payment example), the Posteriori probability will be expressed slightly differently. For example P(Amount=123|Household) = 0.4 - can be translated as: the probability of the payment being part of my household payments is 0.4, when the amount was 123$.

When we classify, we compute the total probability for each category (or class if you want) and we select the category with maximal probability. We have to thus iterate over all the categories and all the features of each item and multiply the probabilities to obtain the probability of the item being in each class.

How it works inside

After calling the Generate method on the model a NaiveBayesPredictor class is created. This class contains the Predict method to classify new objects.
My model can work with three types of features (or characteristics, or properties):

String properties. These properties have to be converted to a binary vectors based on the words which they contain. The classifier builds a list of all existing words in the set and then the String feature can be represented as a set of binary features. For example if the bag of all worlds contains four words: (Hello, World, Is, Cool), than the following vector [0,1,0,1] represents text "World Cool".
Binary properties. Simple true or false properties
Continuous properties. By default these are Double or Decimal values, but the list could be extend to other types.

After converting the String features to binary features, we have two types of features:

Binary features
Continuous features

As mentioned in the introduction for each feature in the item we have to compute the A priori and Posteriori probabilities. The following pseudocode shows how to estimate the values of A priori and Posteriori probabilities. I use array-like notation, just because I have used arrays also in the implementation.

Apriori probability

The computation of Apriori probability will be the same for both type of features.

Apriori[i] = #ItemsOfCategory[i] / #Items

Posteriori probability

The Posteriori for binary features will be estimated:

Posteriori[i][j] = #ItemsHavingFeature[j]AndCategory[i] / #ItemsOfCategory[i]

And the Posteriori probability for contiunous features:

Posteriori[i][j] = Normal(Avg[i][j],Variance[i][j],value)

Where Normal references the normal probability distribution. Avg[i][j] is the average value of feature "j" for items of category "i". Variance[i][j] is the variance of feature "j" for items of category "i".
If we want to know the probability of payment with Amount=123 being of category "Food", we have the average of all payments of that category let's say: Avg[Food][Amount] = 80, and we have the Variance[Food][Amount] = 24, then the posteriori probability will be equal: Normal(80, 24, 123).

What does the classifier need?

The response to this question is quite simple, we need at least 4 structures, the meaning should be clear from the previous explication.

public double[][] Posteriori { get; set; }
public double[] Apriori { get; set; }
public double[][] CategoryFeatureAvg { get; set; }
public double[][] CategoryFeatureVariance { get; set; }

And how does it classify?

As said before the classification is a loop for all the categories in the set. For each category we compute the probability by multiplying apriori probability with posteriori probability of each feature. As we have two types of features, the computation differs for both of them. Take a look at this quite simplified code:

public T Predict (T item){
 Vector values; // represents the item as a vector
 foreach (var category in Categories)
 {
 for (var feature in Features)
 {
 if (NaiveBayesModel<t>.ContinuesTypes.Contains(feature.Type))
 {
 var value = values[feature];
 var normalProbability = Helper.Gauss(value, CategoryFeatureAvg[category][j], CategoryFeatureVariance[category][j]);
 probability = probability * normalProbability;
 }
 
 if (feature.Type == typeof(bool)) //String properties are converted also to binary
 {
 var probabilityValue = Posteriori[category][j];
 }
 }
 
 if (probability > maxProbability)
 {
 maxProbability = probability;
 maxCategory = category;
 }
 }
 item.SetValue(maxCategory);
}

That's all there is to it. Once you understand that we need just 4 arrays, it is just a question of how to fill these arrays, that is not hard (it should be clear from the previous explication), but it takes some plumbing and looping over all the items in the learning collection.
If you would like to see the Source Code - check my fork machine.codeplex.com.

Posted by Jan Fajfr Žádné komentáře: