Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:41

See also: - Why not use Double or Float to represent currency? Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required

See also: - Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required

Source Link

answered Mar 3, 2014 at 6:08

palacsint

answered Mar 3, 2014 at 6:08

palacsint

30.3k
9
82
157

I'm not familiar with Hadoop, take my advice with care.

Are you sure that the following is precise enough?

 double count = 0d; // should be an int, but anyway...

Consider the following:

 final double original = 123456789123456789.0;
 double d = original;
 d++;
 System.out.println(d == original); // true
 System.out.printf("%f%n", original); // 123456789123456784,000000
 System.out.printf("%f%n", d); // 123456789123456784,000000

I'd use a long there. I guess it's not a problem here but if I'm right the reducer could suffer from this.

See also: - Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required

tokens = line.split("\\s+");

The implementation of split was simply this in Java 6:

 public String[] split(String regex, int limit) {
 return Pattern.compile(regex).split(this, limit);
 }

Java 7 contains some fast-paths, but at the end it still calls:

 return Pattern.compile(regex).split(this, limit);

In your case (with the \\s+ pattern) it does not use any fast-path, so it might be worth to store the compiled Pattern instance and call split on that. (I guess it would be faster but the JVM might cache that for you.)

 double count = 0d, max = 0d, min = 0d, sum = 0d, sumSquared = 0d;

I'd put the variable declarations to separate lines. From Code Complete 2nd Edition, p759:

With statements on their own lines, the code reads from top to bottom, instead of top to bottom and left to right. When you’re looking for a specific line of code, your eye should be able to follow the left margin of the code. It shouldn’t have to dip into each and every line just because a single line might contain two statements.

Additionally, max and min is not used (they are only written), you could safely remove them (or might want to print them to the output).

 @Override
 public void reduce(Text key, Iterator<DoubleWritable> values,
 OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
 if (key.equals(new Text("count"))) {

If Text is thread-safe/immutable you could store new Text("count") (and the other ones) in fields instead of constructing them on every call for equals as well as for output.collect():

 output.collect(new Text("sumSquared"), new DoubleWritable(sumSquared));

It looks like that sum and sumSquared have the same implementation in the reducer. Is that a bug? If not you could create a method to eliminate the duplicated logic.
sum, sumSquared, min, max should be constants instead of magic numbers. There are used multiple times.
Declaring variables before their usage with a bigger scope looks microoptimization:

String line;
String[] tokens;
double observation;
while (scanner.hasNext()) {
 line = scanner.nextLine();
 tokens = line.split("\\s+");
 observation = Double.parseDouble(tokens[1]);

It would be readable declaring them inside the loop. (Effective Java, Second Edition, Item 45: Minimize the scope of local variables)

String line;
String[] words;
line = br.readLine();
while (line != null) {
 words = line.split("\\s+");

also could be

 String line = br.readLine();
 while (line != null) {
 final String[] words = line.split("\\s+");

The reduce method contains a lot of similar structures. I'd consider using a function interface with some implementations:

 public interface AggregateFunction {
 void addValue(double value);
 double getResult();
 }
 public class MaxFunction implements AggregateFunction {
 private double max = Double.MAX_VALUE;
 @Override
 public void addValue(final double value) {
 max = Math.max(value, max);
 }
 @Override
 public double getResult() {
 return max;
 }
 }
 public class SumFunction implements AggregateFunction {
 private double sum = 0.0;
 @Override
 public void addValue(final double value) {
 sum += value;
 }
 @Override
 public double getResult() {
 return sum;
 }
 }

The will reduce the Reducer:

 public class Reduce extends MapReduceBase 
 implements Reducer<Text, DoubleWritable, Text, DoubleWritable> {
 // assuming that Text is thread-safe/immutable
 private final Text countKey = new Text("count");
 private final Text maxKey = new Text("max");
 private final Text minKey = new Text("min");
 private final Text sumKey = new Text("sum");
 private final Text sumSquaredKey = new Text("sumSquared");
 @Override
 public void reduce(Text key, Iterator<DoubleWritable> values,
 OutputCollector<Text, DoubleWritable> output, 
 Reporter reporter) throws IOException {
 aggregate("count", countKey, values, output, new SumFunction());
 aggregate("max", maxKey, values, output, new MaxFunction());
 aggregate("min", minKey, values, output, new MinFunction());
 aggregate("sum", sumKey, values, output, new SumFunction());
 aggregate("sumSquared", sumSquaredKey, values, output, 
 new SumSquaredFunction());
 }
 }

line = br.readLine();
while (line != null) {
 words = line.split("\\s+");
 switch (words[0]) {
 ...
 }
 line = br.readLine();
}

I'd restructure the loop for better readability:

 while (true) {
 final String line = br.readLine();
 if (line == null) {
 break;
 }
 final String[] words = line.split("\\s+");
 ...
 }

String line;
line = "avg\t" + String.valueOf(avg) + System.lineSeparator();
bw.write(line);
line = "variance\t" + String.valueOf(variance) + System.lineSeparator();
bw.write(line);

The line variable is used for multiple purposes. Using separate variable would make the code easier to read.

 final String averageLine = "avg\t" + String.valueOf(avg) + System.lineSeparator();
 bw.write(averageLine);
 final String varianceLine = "variance\t" + String.valueOf(variance) + System.lineSeparator();
 bw.write(varianceLine);

Anyway, using a PrintWriter would be the best:

 bw.printf("avg\t%d%n", avg);
 bw.printf("variance\t%d%n", variance);

Instead of commenting write a unit test for it:
```
 // keep in alphabetical order or KABOOM!
```

It is much safer, especially if you use continuous integration or run unit test automatically on every build.

There are unused imports:

import org.apache.hadoop.util.*;

It's cleaner omit them. Eclipse could do that for you with a keypress (Ctrl+Shift+O).

public class calcAll {

According to the Code Conventions for the Java Programming Language class names should start with uppercase letters.

I'd try to give a more descriptive name, SomethingStatistics or something similar.

Code Conventions for the Java Programming Language, 9 - Naming Conventions
Effective Java, 2nd edition, Item 56: Adhere to generally accepted naming conventions

Calling the mapper Map is confusing (since there is a java.util.Map too). I'd choose something more descriptive. The same is true for Reduce.

Class names should be nouns, not verbs, like Map or Reduce.

lang-java