The Wall: Computer Science

Zobrazují se příspěvky se štítkemComputer Science. Zobrazit všechny příspěvky

sobota 31. května 2014

Detecting extreme values in SQL

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

In a set of data points, outliers are such values that theoretically should not appear in the dataset. Typically these can be measurement errors or values caused by human mistakes. In some cases outliers are not caused by errors. These values affect the way that the data is treated and any statistics or report based on data containing outliers are erroneaus.

Detecting these values might be very hard or even impossible and a whole field of statistics called Robust Statistics covers this subject. If you are further interested into the subject please read Quantitative Data Cleaning For Large Databases written by Joseph M. Hellerstein from UC Barkeley. Everything that I have implemented here is taken from this paper. The only thing that I have added to that are two aggregates for SQL Server which help efficiently get the outliers and extreme values from the data stored in SQL Server and a simple tool to chart data and distribution of data using JavaScript

Theory

Any dataset can be characterized by the way the data is distributed over the whole range. The probability that a single point has given value in the dataset is defined using the probability distribution function. The Gaussian standard distribution is only one among many distribution functions, I won't go into statistics basics here, but let's consider only the standard distribution for our case.

In the Gaussian distribution the data points are somehow gathered around the "center", and most values fall not far. Rare are the values really far away from the center. Intuitively the ouliers are points very far from the center. Consider the following set of numbers which represent in minutes the length of a popular song:

3.9,3.8,3.9,2.7,2.8,1.9,2.7,3.5, 4.4, 2.8, 3.4, 8.6, 4.5, 3.5, 3.6, 3.8, 4.3, 4.5, 3.5,30,33,31

You have probably spotted the values 30,33 and 31 and you immediately identify them as outliers. Even if the Doors would double the length of their keyboard solo we would not get this far.

The standard distribution can be described using the probability density function. This function defines the probability that the point will have given value. The function is defined with two parameters: the center and the dispersion. The center is the most common value, the one around which all others are gathered. The dispersion describes how far the values are scattered from the center.

The probability that a point has a given value, provided that the data has the Gaussian distribution is given by this equation:

We can visualize both the theoretical and the real distribution of data. The distribution probability density function is continuous and thus can be charted as simple line. The distribution of the real data in turn can be visualized like a histogram.

The following graphics were created using KoExtensions project, which is a small mediator project making Knockout and D3 work nicely together.

In perfect world the center is the mean value. That value is probably not a part of a data set but represents the typical value. The seconds measure which describes how far from the center the data is dispersed is called standard deviation. If we want to obtain the standard deviation from data we take the distance of each point from the center, square these values add them together and take square root. So we actually have all we need to get the distribution parameters from the data.

This approach of course has one main flaw. The mean value is affected by the outliers. And since the dispersion is deduced using the mean and outliers affect the mean value, the dispersion as well will be affected by them. In order to get a description of the dataset not affected by the extreme values one needs to find robust replacements for the mean and the dispersion.

Robust Center

The simplest and very efficient replacement for the mean as the center of the data is the median. Median is such value that half of the points in the dataset are smaller are bellow. That is if the data set consists of even number of samples, we just need to order the values and take mean of the two values in the middle of the ordered array. If the data consist of odd number of values then we take the element exactly in the middle of the ordered data. The before mentioned paper describes two more alternatives: trimmed mean, winsorized mean. Both of these are based on the exclusion of marginal values and I did not used them in my implementation.

Let's take the median of the given dataset and see if the distribution function based on it fits better the data. Even though the center is now in correct place the shape of the function does not fit completely the data from the histogram. That is because the variance is still affected by the outliers.

Robust Dispertion

Standard variance takes into account the the distance of all the numbers from the center. To rule out the extreme values, we can just use the median of distances. The outlier's distance from the center is much bigger than other distance and by taking the median of all distances we can get rid of outlier's influence over the dispersion. Here is the distribution function using this Robust type of dispersion. This characteristic is called MAD - Median Absolute Deviation.

Detecting the outliers

Now that we have the value of "real" center and "real spread" or dispersion we can state that the outlier is a value which differs "too much" from the center, taking into account the dispersion. Typically we could say that the outliers are such values that have a distance from center greater or equal to 10 * dispersion. The question is how to specify the multiplication coefficient. There is a statistics method called Hampel Indetifier which gives a formula to obtain the coefficient. Hampel identifier labels as outliers any points that are more than 5.2 away from the MAD. More details can be found here.

The overall reliability of this method

A question which might arise is up to which kind of messy data this method can be used. A common intuition would say that definitely more than the half of the data has to be "correct", in order to be able to detect the incorrect ones. To be able to measure the robustness of each method of detecting outliers, statisticians have introduced a term called Breakdown point. This point states which percentage of the data can be corrupted in order for given method to work. Using the Median as the center with the MAD (Median Absolute Deviation) has a breakdown point of 1/2. That is this method works if more than half of the data is correct. The standard arithmetic mean has a BP = 0. It is directly affected by all the numbers and one single outlier can completely move the data.

SQL implementation

In order to implement detection of outliers in SQL, one needs to first have the necessary functions to compute the mean, median and dispersion. All these functions are aggregates. Mean (avg) and Dispersion (var) are already implemented in SQL Server. If you are lucky enough to use SQL Server 2012 you can use the built-in median aggregate as well. The robust dispersion however has to be implemented manually even on SQL Server 2012.

Implementing aggregates for SQL Server is very easy, thanks to the predefined Visual Studio template. This templates will create a class for you which implements the IBinarySerializable interface and is decorate with couple attributes defined in the Microsoft.SqlServer namespace.

This class has 4 important methods:

Init - anything needed before starting the aggregation
Accumulate - adding one single value to the aggregate
Merge - merging two aggregates
Terminate - work to be done before returning the result of the aggregate

Here is the example of the Median aggregate

private List ld;
public void Init()
{
 ld = new List();
}
public void Accumulate(SqlDouble value)
{
 if (!value.IsNull)
 {
 ld.Add(value.Value);
 }
}
public void Merge(Median group)
{
 ld.AddRange(group.ld.ToArray());
}
public SqlDouble Terminate()
{
 return Tools.Median(ld);
}

Note that some aggregates can be computed iteratively. In that case all the necessary logic is in the Accumulate method and the Terminate method can be empty. With Median this is not the case (even though some iterative estimation methods exist. For the sake of the completeness, here is the implementation of median that I am using. It is the standard way: sorting the array and taking the middle element or average of the two middle elements. I am returning directly SqlDouble value, which is the result of the aggregate.

public static SqlDouble Median(List ld)
{
 if (ld.Count == 0)
 return SqlDouble.Null;
 ld.Sort();
 int index = ld.Count / 2;
 if (ld.Count % 2 == 0)
 {
 return (ld[index] + ld[index - 1]) / 2;
 }
 return ld[index];
}

Implementing the Robust variance using the MAD method is very similar, everything happens inside the Terminate method.

public SqlDouble Terminate()
{
 var distances = new List();
 var median = Tools.Median(ld);
 foreach (var item in ld)
 {
 var distance = Math.Abs(item - median.Value);
 distances.Add(distance);
 }
 var distMedian = Tools.Median(distances);
 return distMedian;
}

That implementation is directly the one described above: we take the distance of each element from the center (median) and than we take the median of the distances.

Outliers detection with the SQL aggregates

Having implemented both aggregates, detecting the outlier is just a matter of a SQL query - giving all the elements which are further away from the center than the variance multiplied by a coefficient.

select * from tbUsers where Height> ( Median(Height) + c*RobustVar(Height)) or Height < (Median(Height) - c*RobustVar(Height))

You will have to play with the coefficient value c to determine which multiplication gives you the best results.

JavaScript implementation

The same can be implemented in JavaScript. If you are interested in a JavaScript implementation you can check out the histogram chart from KoExtensions. This charting tool draws the histogram and the data distribution function. You can than configure it to use either Median or Mean as the center of the data as well as to use MAD, or standard variance to describe the dispersion.

KoExtensions is based on Knockout.JS and adds several useful bindings and the majority of them to simplify charting. Behind the scenes the data is charted using D3.

To draw a histogram chart with the distribution and detecting the outliers at the same time one needs just few lines of code

<div id="histogram" data-bind="histogram: data, chartOptions : {
 tolerance : 10,
 showProbabilityDistribution: true,min : -20,
 expected: 'median',
 useMAD: true,
 showOutliers: true}">

var exData = [3.9,3.8,3.9,2.7,2.8,1.9,2.7,3.5, 4.4, 2.8, 3.4, 8.6, 4.5, 3.5, 3.6, 3.8, 4.3, 4.5, 3.5,30,33,31];
 
function TestViewModel() {
 var self = this;
 self.data = ko.observableArray(exData);
}
var vm = new TestViewModel();
ko.applyBindings(vm);
initializeCharts();

Knockout.JS is a JavaScript MVVM framework tool which gives you all you need to create bi-directional binding between the view and the view model, where you can encapsulate and unit test all your logic. KoExtensions adds a binding call "histogram", which takes simple array and draws a histogram. In order to show the probability function and the outliers one has to set the options of the chart as shown in the example above.

Posted by Jan Fajfr Žádné komentáře:

čtvrtek 26. září 2013

Introduction to GPU programing through AMP C++

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

Few months ago I tried to learn a bit about GPU programming and I took notes a started to write this post. I am publishing this now even though it's not complete, however being too busy, I am not sure whether I will have the time to get back to this later.

Since couple years CUDA (*2007) and OpenCL (*2008) have established themselves as standard frameworks for parallel programming on the GPU. In 2011 new framework came to the landscape of GPGPU programming, which does not compete on the same level, but rather represents new abstraction between the GPU and the developer.

AMP stands for Accelerated Massive Parallelism and it is an open C++ standard. The implementation of AMP provided by Microsoft consists of three main features:

C++ language features and compiler
Runtime
Programming framework containing classes and functions to facilitate parallel programming

As well as OpenCL and unlike CUDA, AMP is a standard not only for GPU programming but for any data-parallel hardware. Typical example of such a hardware is the vector unit of standard CPU, capable of executing SIMD (Single Instruction Multiple Data) instructions. Or in the near future a cloud based cluster. AMP is an abstraction layer which can be used to target different devices from different vendors.

The following diagram depicts how AMP C++ fits to the world of GPGPU programming.

Microsoft has implemented AMP on top of DirectCompute technology and it is the only production ready implementation. Currently there are also two Proof Of Concepts AMP implementations Shelvin park project done by Intel and open source modification of the LLVM compiler (more information here). Needless to say that the adoption of such standard will depend a lot on whether these implementations will be taken further and whether we will have real implementation based on CUDA.

First look at the code

Without further introduction let's take a look at the classical "Hello World" of GPGPU programming: Addition of elements of two vectors (here without the boiler plate code).

parallel_for_each(e,[=](index<2>idx) restrict(amp){
 c[idx] = a[idx] + b[idx];
});

If you have ever tried out OpenCL or CUDA, you should find this piece of code less verbose and quite easy to read. That is the result of the main ideas behind AMP C++:

Full C++ - unlike CUDA or OpenCL, AMP is not a C like dialect.
Minimalist API - AMP hides as much as it can from the developer.
Future proofed - the API is designed for heterogeneous programming. In the future GPU should be only one of it's usages. It aims to be a standard for programming on distributed systems.

Before going futher with more examples we have to provide basic description of GPU architecture as it is crutial to understanding certain parts of AMP

GPU architecture

Here I will give a high level overview of the GPU architecture. If you are already familiar with the GPU architecture, feel free to skip this section.

Currently there are two main producers of GPU chips: NVidia and ATI (AMD). A little behind comes Intel. Each of them constructs graphic cards with different architecture. Moreover the architectures change significantly with the realease of any new version. Nevertheless certain concepts are shared by all of them. I will describe here shortly the NVidias Fermi and ATIs Cypress architecture. The succesor of Fermi is called Kepler.

Processor architecture

The GPU is composed of hundreds of pipelined cores. The cores are grouped to computation units. NVidia calls these units Symmetric Multiprocessors. Each computation unit is assigned a unit of work.

ATI uses slightly different abstraction. Similary to NVidia the "cores" are grouped to Symmetric Multiprocessors, however each core is called Thread Processor and is a capable of executing VLIV (Very Long Instruction Word) instructions. It has therefore 4 arithmetic logical units and one Special Function Unit. The compiler has to find independent operations and construct the VLIW instruction. If the compiler is not able to find these independent operations than one or more of the ALUs will not be performing any operation.

Memory access

The GPU has a global memory which can be accessed by all cores. This access is quite costly (hundreds of cycles). Each simetric multiprocessor in turn has a small scratch-pad memory which is called usually local memory. This memory can be accessed only by the threads running on the given SM. The access to local memory is much cheaper (around 10 cycles). This memory can either take a role of a cache or can be managed by the developer directly. In addition each core has its own general purpose registers, used to perform the computations.

GPUs usually run computations on data which is accessed only once, unlike CPUs where the same memory pages are often used repetively. For this reason it is better for the developer to control the local memory. That's why historically the local memory did not act as cache on GPUs. However this will probably change in the future, and some form of caching will be provided by the GPU directly.

Scheduling

Scheduling is essential because it allows the effective usage of the processor. Since the memory access is expensive, other operations might be executed while the processor is waiting for the data to come. NVidia processors have "costless" context switching, so while one thread is blocked other can take over. Therefore the number of threads scheduled might affect the computation performance, while if not enough threads are available for scheduling some might be waiting for the memory page loads.

The programming models comparison:

These architectural concepts are used by NVIDIA and ATI. Each of the 3 GPU programming models (CUDA, OpenCL, AMP) can has its own dialects and namings. The following table shows the terms used by all three technologies. We will discuss the AMP terms used in this table later in the article.

Term	CUDA	AMP C++	OpenCL
Basic unit of work	thread	thread	work-item
The code executed on one item	kernel	kernel	kernel
The unit processing one group of working units	Streaming multiprocessor	-	Compute unit
Cache-style memory accessible by grouped threads	shared memory	tile static memory	local memory
Group of working units sharing local memory	warp	tile	work-group

The tools offered by each framework

Now when the architecture of GPU has been described we can define the tools which each framework needs to provide. In the next part of the article I will try to describe how AMP adresses these issues.

Tell the compiler which part of the code will be offloaded to the GPU
Provide constructs to work with multidimensional vectors
Provide a way to transfer the data between the GPU and the CPU
Give the developer tools to write efficient programs. That is to address GPU specific problems such as memory access

AMP detailed description and more examples

Here is again the first example with the little of boiler plate code that we have to write to make it work:

void vectorsAddition_Parallel(vector<float> vA, vector<float> vB, vector<float>vC, int M, int N){
 extent<2> e(M,N);
 array_view<float,2> a(e, vA), b(e,vB);
 array_view<float,2> c(e, vC);
 //capture of the data using array_view -> results in the copy of the data into the accelerators memmory.
 parallel_for_each(e,[=](index<2>idx) restrict(amp){
 c[idx] = a[idx] + b[idx];
 });
}

The code sent and executed on the GPU is called "kernel" (here one line of code). The kernel is passed to the GPU in the form of lambda expression through the parallel_for_each method. This static method is the entry point to AMP. This method takes two parameters parallel_for_each(extent, delegate). The extend parameter describes the dimensionality of the data. The delegate encapsulates the logic which will be executed. The logic is usually defined as an anonymous function which takes the index<N> as parameter. The computation is expressed using this index on previously defined arrays. In the aboved sample the c[idx] = a[idx] + b[idx] simply means, that for each index (and index goes from 0,0 to N,N since it is two dimensional index) the elements at these positions of the arrays will be added and stored in the array c. Of course that this code is not executed sequentially, but instead defined as a set of vector operations which are scheduled on the GPU

The extent as well as the index parameter are templated. The index identifies one unique position in N dimensional array. The extent describes the dimensions of the computation domain.

The array_view class takes care of the copy of the data to and from the GPU. When the computation is finished, the synchronized method is called on the vC array_view. This call will synchronize the C vector with the array_view. To give the complete information, note also that there is an array class which behaves similary, having few inconvenience and advantages. This post gives a good comparison of these two classes.

The following example ilustrates some of the dificulties which we can have when writing parallel code. The code simply sums all the elements in an array in parallel way. The parallelization of addition of items requires a bit of engineering, even though the sequential solution is evident. Here I am using a technique which is not really efficient but demonstrates the principles of parallelization. Several techniques to parallelize the task are described in this article.

float sumArray_NaiveAMP(std::vector<float> items){
 auto size = items.size();
 array_view<float, 1> aV (size, items);
 
 for(int i=1;i<size;i=2*i){
 parallel_for_each(extent<1>(size/2), [=] (index<1> idx) restrict(amp)
 {
 aV[2*idx*i] = aV[2*idx*i] + aV[2*idx*i+i];
 });
 }
 return aV[0];
}

The algorihtm adds each two neighbouring items and stores the result in the first item. This has to be repeated until the addition of the whole array is stored in the first position in the array. As described in the mentioned article this approach is not memory efficient and optimizations exists. Note also that the synchronize method is not called on the array_view at the end of the computation. That is because, we don't want the modified data to be copied back from the GPU to the main memory, we are only interested in the sum of the elements.

Another example here is the computation of the standard deviation of the values in an array. First step is the computation of the avarage of the array. To obtain the avarege we have to first add the elements in the array (using the previous example). Having the average, we can obtain the distance of each element from the average. Once we have the distance of each element, we have to make another addition before taking the final square root and obtaining the standard deviation.

float standatd_deviation(vector<float> vA) {
 
 float size = vA.size();
 
 extent<1> e((int)size);
 vector<float> vDistance(size);
 
 array_view<float, 1> a(e, vA);
 array_view<float, 1> distance(e,vDistance);
 
 float dispertion = 0;
 float total = 0; 
 total = sumArray_NaiveAMP(vA);
 float avg = total/size;
 parallel_for_each(
 e, 
 [=](index<1> idx) restrict(amp) {
 distance[idx] = (a[idx] - avg)*(a[idx] - avg);
 });
 distance.synchronize();
 dispertion = sumArray_NaiveAMP(vDistance);
 return sqrt(dispertion/size);
}

This algorithm has 3 parallelized parts: the two sums at the beginning and at the end, and than the calculation of the distance of each element.

Looking at both of the preceding examples, you might be wondering why the code is so complex and you might think that the sum of the elements in the array could be just written as:

float sum = 0;
parallel_for_each(
 e, 
 [=](index<1> idx) restrict(amp) {
 sum+=a[idx];
});

However if you think a bit more, you should understand that the code in the parallel_for_each runs essential in the same time. All the parallel threads would like to increment the sum variable at the same time. In addition to that, this code would not even compile, while the modifications of variables captured "by value" are not allowed and in this example the sum variable is captured by value. If you are not familiar with the different capture types refer to this page.

Here is one more example which ilustrates how index and extent work, it is the second hello world of parallel computing: matrix multiplication. This example come from this MSDN page.

void matrixMultiplicationWithAMP(vector<float> &vC, vector<float> vA, vector<float> vB, int M, int N) {
 
 extent<2> e(M,N);
 array_view<float, 2> a(e, vA);
 array_view<float, 2> b(e,vB);
 array_view<float, 2> product(e, vC);
 
 parallel_for_each(
 product.extent, 
 [=](index<2> idx) restrict(amp) {
 int row = idx[0];
 int col = idx[1];
 for (int inner = 0; inner < M; inner++) {
 product[idx] += a(row,inner) * b(inner,col);
 }
 }
 );
 product.synchronize();
}

Note that the resulting vector vC i passed to the method as reference, since it's content is modified by the synchronize call. Also note, that this example assumes that the vectors passed to the function contain two dimensional array of size (N,N). Since AMP supports multidimensional indexes, AMP runs over all the columns and all the rows automatically, just by iterating over the two-dimensional index. The inner loop just sums the multiplications of the elements in the current row of the left matrix and the current column of the right matrix.

Moving the data between GPU and CPU

As mentioned before, the array_view and array classes are used to transfer the data between the CPU and GPU. The array class directly copies the data to the GPUs global memory. However the data from the array has to be then sent manually back to the CPUs memmory. On the other hand the array_view class works as a wrapper. The vector passed to the array_view will in the background copy the data from and to the vector which is passed in as parameter.

Memory access and manipulation on AMP

As described above, the developer has to address the GPU and adapt the algorithm to the architecture. This basically means minimize the access to global memmory and optimize the threads to use the local memmory. This process is called tiling in the AMP's parlance and the local memmory is called tile-static memory.

If the developer does not define any tiling, the code will be tiled by default. In order to use the local memory efficiently, algorithm has to be tiled manualy. Parallel_for_each method has a second overload which accepts tile_extent as a parameter and the code receives tiled_index in the lambda. Similary as the extend the tile_extend specifies the dimensionality of the computation domain, but also separates the whole computation domain into several tiles. Each tile is than treated by one symetrical multiprocessor, therefor all the threads in the tile can share the local memory and benefit from the fast memory access. If you want to read a bit more about tiling visit this page.

About the future of AMP

As said at the beginning AMP is a standard and as any standard it is dependent of it's implementations. Currently there are two existing implementations of the AMP standard. Microsoft implemented AMP on top of Direct Compute technology. Direct Compute is a part of Microsoft's DirectX suite, which was originally suited only to multimedia programming. Microsoft added DirectComputed to the DirectX suite in order to enable GPGPU programming and with AMP C++ provides an easy way to manipulate the API. The second implementation is very recent and was developed by Intel under the code name Shelvin Park. This implementation builds on top of OpenCL.

Summary

Clearly the success of the standard depends on whether other implementations targeting CUDA and OpenCL will emerge. Microsoft cooperated with NVidia and AMD during the development of the API. The idea of having a clear C++ standard to define the parallel computation is great. Latest C++ is quite modern language and provides nice constructs, so actually the programming using AMP C++ is quite fun and not that much pain.

Links

Introduction and few samples Parallel Programing in Native Code - blog of the AMP team

CUDA architecture evolution

GeForce GTX 680 Kepler Architecture

Comparison between CUDA and OpenCL (though little outdated 2011年06月22日)

http://msdn.microsoft.com/en-us/library/hh553049.aspx

Introduction to Tiling

Implementation of LLVM & Clang to support AMP C++ on NVidia

Posted by Jan Fajfr Žádné komentáře:

středa 27. června 2012

Programming languages for the age of Cloud

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

This posts talks about the aspects which are influencing computer languages these days. We are in the age when the sequential execution is over. Even your laptop has a processor with several cores. The cloud provides us with tons of machines whic we can use to run our code on. We are in the age of distribution, parallelization, asynchronous programming and concurrency. As developers we have to deal with the chalenges which arise from this new environement. Computer language scientists have worked on the subject since the seventies. Nowadays concepts which have been studied for long time, influence the mainstream languages. This post describes how. The motivation for this post was this panel discussion at the last's year Lang.NEXT conference, where one of the greatest language architects of these days discuss what the ideal computer language should look like (Anders Hejlsberg - C#, Martin Odersky - Scala, Gilad Bracha - Newspeak, Java, Dart and Peter Alvaro).

Web and Cloud programming

"Web and cloud programming" was the title of the mentioned panel discussion. Cloud programming is quite noncommittal term. What do we mean by "cloud programming"? Is it programming on the cloud (with the IDE in the cloud)? or programming applications for the cloud (but that can be just any web app right)? It turns out this is just a term to depict the programming in distributed environment.

Programming in distributed environment

Programming in distributed environment is much better term - but again it might not be completely clear. The majority of todays applications is not sequential anymore. The code flow of the program is parallel, asynchronous and the program has to react to external events. The code and the application itself is distributed. It might be distributed over several cores, or nodes or it might be just server side - client side code separation. You can have code running on the backend, some another bits (maybe in different language) running on the front, some code is waiting for response from a web service and some other code is waiting for the response of the user on the client side. You as a developer have to handle the synchronization.

We might actually say that todays web programming is distributed and asynchronous. As developers, we have to make the switch from the traditional sequential programming to the distributed and asynchronous code. The advent of cloud computing is forcing this transition.

Non-sequential, parallel or asynchronous code is hard to write, hard to debug and even harder to maintain. Write asynchronous code is challenging, however write asynchronous application in a transparent maintainable manner might feel impossible. Just think about the global variables which you have to create to hold the information about ‘current situation’, so that when a response from a web service arrives you are able to decide and take the right actions. It is this maintance of the global state which is particulary dificult in asynchronous programming.

What are the tools which will help us with the transition to distruted asynchronous or parallel coding?

Here is a list of 3 tools which I think might be helpful:

Conteptual models - As developers we can follow some conceptual model - for instance the actors model in order to organize and architecture the program.
Libraries - To implement one of the models (or design patterns) we can use tested and well document code – for instance Akka
Computer languages - The biggest helper will be on the lowest level - the computer language itself.

Models, Libraries and languages

Libraries are and will be the principal tools to make developers life easier. There are several libraries available to help with the asynchronous, event-driven programming for many different languages. Akka, Node.JS or SignalR are just examples. But libraries themselves are build using languages. So the question is: What can languages bring to help make the life easier for developers in the age of cloud and distribution?

Modern languages have two characteristics:

They are build on powerful type systems: dynamic or static ones.
They borrow from different programming language paradigms.
Lately have been strongly influenced by functional programming.

Benefits of functional languages

Functional languages might be one of the helpers in the age of distributed computing. Some imperative languages are introducing functional aspects (for instance C# has been heading that way since long time), another ones designed from scratch are much more closer to "pure" functional style (Scala, F#, or the puriest - Haskel). Let's first define some terms and possible benefits of functional programming. From my point of view (and I admit quit simplified point of view) there are four aspects of functional programming that are useful in everyday coding.

Elimination of the "global state" – the result of method is guaranteed no matter what the actual state is.
The ability to treat functions as first class citizens.
Presence of immutable distributable structures - mainly collections.
Lazy evaluation – since there is no global state, we can postpone the execution and evaluation of methods till the time the results are needed.

Eliminating the state

It's hard to keep shared state in parallel systems written in imperative languages. Once we leave the save sequential execution of the language, we are never sure what are the values in the actual state. Callbacks might be executed about any time depending for example on network connection and the values in the main state could have changed a lot from the time of the "expected execution". Purely functional programming eliminates the outer "state of the system". The state has to be passed always locally. If we imagine such a language all the methods defined would need and additional parameter in the signature to pass the state.

int computeTheTaxes (List<income> incomes, StateOfTheWorld state);

That is really hard to imagine. So as it has been said in the discussion: pure functional programming is a lie. However we can keep this idea and apply it to some programming concerns. For instance the immutability of the collections might be seen as application of “no current state" paradigm.

Since the “current state” does not exists, the result of a function invoked with the same arguments should be always the same. This property is called “Referential transparency” and enables the lazy evaluation. So the elimination of the global state might be seen as the pre-condition for using other functional language features such as lazy evaluation

Function as a first class citizen

Another aspect of functional programming is the fact that functions become first class citizens. That means, that they can be passed as arguments to other functions (these are called higher order functions). This is extremely powerful concept and you can do a lot with it. Functions can be also composed and the functional compositions applied to values. So instead of applying a several consecutive functions on a collection of values, we can compose the resulting function and apply it at once. in C# the LINQ uses a form of functional composition, which will be discussed later.

Lambdas and Closures

Lambdas are anonymous functions - defined on the fly. Closure add the ability to capture the variables defined around the definition of the function. C# has closures and lambdas since the version 3.0, they should finally arrive to Java in the version 8. Talking about JavaScript, it just seems that they have always been there. Anytime you define an anonymous function directly in the code you know you can use the variables from the current scope in your function. Hence you are creating a closure.

var a = 10;
var input = 5
request(input, function(x) { 
 a = x;
});

Any time you use the variable from outer scope, in the inner anonymous function, we say that the variable is captured

Closures are also available in Python and since the version 11 they are even available in C++. Let's stop for a while here, because C++ adds the ability to distinguish between variables captured by reference and variables captured by value. The syntax for lambdas in C++ is a bit more complicated, but allows the developers for each variable to specify how it should be captured. In this folowing code the v1 variable is captured by value and all the variables are captured as references. So the value of v2 will depend on what happened before the lambda actually executed.

int v1 = 10;
inv v2 = 5;
for_each( v.begin(), v.end(), [=v1,&] (int val)
{
 cout << val + v2 - v1; });

You can see, that even such and old school imperative language like C++ has been influenced and modified to embrass functional programming.

Closures add the ability to use the current state from the moment of the definition of the anonymous function, as the current state used during the functional execution.

Collections in functional programming

In functional programming languages (the pure ones), collections are immutable. Instead of modification a copy of the collection is returned on each operation which is performed on the collection. It is up to the designers of the language to force the compiler to reuse the maximum of the existing collection in order to lower the memory consumption. This allows the programmer to write the computation as a series of transformations overt the collections. Moreover thanks to lambdas and closures, these transformations may be defined on the fly. Here is a short example:

cars.Where(x=>x.Mark == ‘Citroen’).Select(x=>x.MaxSpeed);

This transformation will return an iterator of speeds of all Citroens in the collection. Here I am using C#/F# syntax, however almost the same code would compile in Scala.

The selector (“Where”) and the mapper (“Select”) both take as argument a function which takes an item of the collection. In the case of the selector the function is a predicate which returns “true” or “false” in case of the mapper, the function just returns a new object. Thanks to lambdas we can define both of them on the fly.

Language integrated data queries

Lazy loading also comes from functional languages. Since the result of the function does not depend on the "state of the world" it does not matter when we execute any given statement or computation. The designers of C# inspired themselves while creating LINQ. LINQ just enables the translation of the above presented chain of transformations to another domain specific language. Since the lazy loading is used, each computation is not performed separately, but rather a form of “functional composition” is used and the result is computed once it is needed. If the ‘cars’ collection would an abstraction for relational database table, the result would be translated into “select maxSpeed from cars where mark=’Citroen’. Instead of two queries (on for each function call).

Inside LINQ translates the C# query (the dotted pipeline of methods) into an expression tree. The tree is then analyzed (parsed) into domain specific language (such as SQL). Expression trees are a way to represent code (computation) as data in C#. So in order to develop and integrate the LINQ magic into the language, the language needs to support function as first class citizen and also has to be able to treat code as data.

Maybe as I wrote it, you are thinking about JavaScript and you are right. In JavaScript you can pass around functions and you can also pass around code and later execute it (the eval function). No wonder that there are several implementations of LINQ for JavaScript.

Similar concept inspired some Scala developers and since Scala posseses the necessary language features, we might see similar concepts in Scala also (ScalaQL).

Dynamic or Typed languages

What are the benefits of Dynamic or Strongly typed language? Let's look for the answer in everyday coding: Coding in dynamic language is at least at the beginning much faster than in a typed language. There is no need to declare the structure of the object before using it. No need to declare the type of the simple values nor objects. The type is just determined on the first assignment.

What works great for small programs might get problematic for larger ones. The biggest advantage of typed system in the language is the fact, that it is safer. It won't let you assign apples to oranges. It eliminates much of the errors such as looking for non-existing method of type.

The other advantage is the tooling which comes with the language. Auto-completion (code completion based on the knowledge of the type system) being example of one such a tool. The editor is able to propose you the correct types, methods, or properties. The types structure is used for analysis or later processing. For instance documentation might be easily generated from the type system just by adding certain metadata.

Several languages offer compromises between the strongly typed (safe) and dynamic (flexible) world. In C# we can use the dynamic keyword to postpone the determination of the object type to runtime. DART offers optional type system. Optional type systems let us use the tooling, without polluting our lifes with too much typing exercises. This comes handy sometimes.

JavaScript as the omnipresent language

JavaScript is everywhere and lately (with Node.JS and MS investing heavily into it) drawing more and more attention. The language has some nice features: it treats the functions as first class citizens, supports closures, it is dynamic, but one big drawback: It absolutely lacks any structure. Well it's not typed language, so we cannot expect any structure in the type system, but it also lacks any modularization.

Objects are defined as functions or 'just on the fly'. And there is always this giant current state which holds all the variables and everything, which get's propagated everywhere. I still did not learn how to write good structured JS programs. And there are to many concepts of JavaScript which I did not understand completely. As it has been said in the discussions: you can probably write big programs in JavaScript, but you cannot maintain them. That's why Google is working on DART. The future version of ECMAScript will try to solve the problems of JavaScript by bringing modular systems,classes, static typing. But the big questions will be of course the adoption by the browsers.

Summary

Libraries will be always the core pieces to enable writing distributed software
Language should be designed in a way to minimize the state and control 'purity' – functional languages are well studied and concepts coming from functional languages will become omnipresent in everyday programming.
Type systems should be there one we need them and should get out of our way when we don’t.

The future might be interesting. Lately I have been forced to write a lot of Java code and interact with some legacy code(<1.5) and besides the typing exercise it does not provide me any benefits. It just bores me. Well I am a fun of C#, because the authors of C# seem to search interesting features from other languages or concepts (closures, expression trees, dynamic typing, or later incorporating the asynchronous model directly to the language) and introduce them to the well established static typed, compiled (and for me well known) world.

But whether it is Scala, Python, Dart, JavaScript or C#/F# - I thing we should be trying to adopt modern languages as fast as possible and that for just one reason: to express more with less code.

Posted by Jan Fajfr 1 komentář:

neděle 11. března 2012

Collective Intelligence: Ants colony solving TSP

Odeslat e-mailem BlogThis! Sdílet na X Sdílet ve službě Facebook Sdílet na Pinterestu

According to wikipedia:

“Collective intelligence is a shared or group intelligence that emerges from the collaboration and competition of many individuals and appears in consensus decision making in bacteria, animals, humans and computer networks”.

This article describes, how to make the ants, find the solution for TSP problem. Implemented in Python. Get the source from GitHUb.

stable_2

The algorithms based on the collective intelligence have some “interesting” properties:

decentralization
parallelism
flexibility, adaptability
"robustness" (failures)
auto-organization

These algorithms are inspired by the nature. Here are some examples of collective intelligence which can be observed in the nature:

The swallows settle on wires before they are taking of for the next destination. There is no leader in the group. The decision whether to take of is taken collectively. The probability for the swallow to take of is getting higher when there are more swallows in the air. If the other swallows do not join, the swallow will again settle down on the wire. At one point the number of the swallows in the air reaches the “break-point” when all the swallows decide to take of.
The bees perform a special “dance” to show their peers where the foot-source is. This dance gives the information about the angle of the food source position with respect to the sun. All the bees do perform the dance when coming back in, which makes the algorithm adaptive.
When the ant founds food, he lays down a "positive pheromone" on his way back. This pheromone evaporates during the time. The other ants sniff for this pheromone when choosing their route and prefer to go in places where the concentration of the pheromone is higher. The shorter the path to the food source is, more pheromone stays on the track before it evaporates. The more pheromone there is, more ants take this path. When there is a obstacle in the route – the algorithm adapts easily to knew situation. Again the shortest route to evict the obstacle is chosen in the shortest time. Some details here.

There exist several applications of collective intelligence in engineering and optimization. The ants example has applications specially in routing. One of the basic problems which can be solved using an Ant colony is Travelling Salesman Problem.

Travelling Salesman Problem

Travelling Salesman Problem is a classical problem in the field of graph theory. Given n cities the salesman has to visit all of the nodes, come back to his starting location and and minimize traveled distance. Although the problem is NP - hard, several heuristic algorithms exists which obtain suitable results in polynomial time.

Ant colony algorithm for TSP

Ant colony algorithm was introduced in year 1997 by Italian researcher Marco Dorigo.

On the beginning the ants start to explore the graph. They choose their nodes randomly, until they visit all of the nodes. A this point the ant starts to copy his path back to his starting point. While he copies the path, on each edge he lays down certain amount of pheromone inversely proportional to the length of the tour. When each ant starts new route in each node he will compute the probability to choose an edge to continue his route. The probability of choosing edge e in each step is computed as follows.

sobota 31. května 2014

Detecting extreme values in SQL

Theory

Robust Center

Robust Dispertion

Detecting the outliers

The overall reliability of this method

SQL implementation

Outliers detection with the SQL aggregates

JavaScript implementation

čtvrtek 26. září 2013

Introduction to GPU programing through AMP C++

First look at the code

GPU architecture

Processor architecture

Memory access

Scheduling

The programming models comparison:

The tools offered by each framework

AMP detailed description and more examples

Moving the data between GPU and CPU

Memory access and manipulation on AMP

About the future of AMP

Summary

Links

středa 27. června 2012

Programming languages for the age of Cloud

Web and Cloud programming

Programming in distributed environment

Models, Libraries and languages

Benefits of functional languages

Eliminating the state

Function as a first class citizen

Lambdas and Closures

Collections in functional programming

Language integrated data queries

Dynamic or Typed languages

JavaScript as the omnipresent language

Summary

neděle 11. března 2012

Collective Intelligence: Ants colony solving TSP

Travelling Salesman Problem

Ant colony algorithm for TSP

The implementation in Python

The results

The source code

čtvrtek 8. března 2012

Graph theory in Latex

Graph with 4 edges in a circle:

Graph with labeled edges

Graph with positioned vertices.

Bipartite graph using the berge styles:

sobota 25. února 2012

Applications of Graph Theory

Obvious applications

Vehicle routing problem and other variants of TSP

Graph coloring problem

Map coloring

Hidden applications

Image reconstruction from X-Rays – Computer tomography.

Proving NP’s ness of some problems such as Integer Linear Programming

Register allocation

Approximation of the data – data compression

Task scheduling

Other interesting applications