|
1 | 1 | # Hadoop Examples
|
2 | 2 |
|
3 | | -[Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) is a framework for performing large-scale distributed computations in a cluster. Different from [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (which we discussed [here](http://github.com/thomasWeise/distributedComputingExamples/tree/master/mpi/)), it is based on Java technologies. It may thus be slower than MPI, but can reap the full benefit of the rich libraries and programming environment of the Java ecosystem (just think about all the things we already did in this course). One of the most well-known ways to use Hadoop is to perform computations following the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI). |
| 3 | +[Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) is a framework for performing large-scale distributed computations in a cluster. Different from [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (which we discussed [here](http://github.com/thomasWeise/distributedComputingExamples/tree/master/mpi/)), it is based on Java technologies. It may thus be slower than MPI, but can reap the full benefit of the rich libraries and programming environment of the Java ecosystem (just think about all the things we already did in this course). One of the most well-known ways to use Hadoop is to perform computations following the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI). |
| 4 | + |
| 5 | +Let us now shortly compare use cases of MPI versus MapReduce with Hadoop. MPI is the technology of choice if |
| 6 | + |
| 7 | +- communication is expensive and the bottleneck of our application, |
| 8 | +- frequent communication is required between processes solving related sub-problems, |
| 9 | +- the available hardware is homogenous (and we can use an MPI implementation optimized for it), |
| 10 | +- processes need to be organized in groups or topological structures to make efficient use of collective communication to achieve high performance, |
| 11 | +- the size of data that needs to be transmitted is smaller in comparison to runtime of computations, and when |
| 12 | +- we do not need to worry much about exchanging data with a heterogeneous distributed application environment. |
| 13 | + |
| 14 | +Hadoop, on the other hand, covers use cases where |
| 15 | + |
| 16 | +- communication is not the bottleneck, because computation takes much longer than communication (think Machine Learning), when |
| 17 | +- the environment is heterogeneous, |
| 18 | +- processes do not need to be organized in a special way and |
| 19 | +- the division of tasks into sub-problems can be done efficiently by just slicing the input data into equal-sized pieces, where |
| 20 | +- sub-problems have batch job character, where |
| 21 | +- data is unstructured (e.g., text) and potentially huge (eating away the advantages of MPI-style communication), or where |
| 22 | +- data comes from and results must be pushed back to other applications in the environment, say to HTTP/Java Servlet/Web Service stacks. |
4 | 23 |
|
5 | 24 | ## 1. Examples
|
6 | 25 |
|
|
0 commit comments