See also: - Why not use Double or Float to represent currency? Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required
See also: - Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required
See also: - Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required
I'm not familiar with Hadoop, take my advice with care.
Are you sure that the following is precise enough?
double count = 0d; // should be an int, but anyway...
Consider the following:
final double original = 123456789123456789.0;
double d = original;
d++;
System.out.println(d == original); // true
System.out.printf("%f%n", original); // 123456789123456784,000000
System.out.printf("%f%n", d); // 123456789123456784,000000
I'd use a long
there. I guess it's not a problem here but if I'm right the reducer could suffer from this.
See also: - Why not use Double or Float to represent currency? - Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required
tokens = line.split("\\s+");
The implementation of split
was simply this in Java 6:
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
Java 7 contains some fast-paths, but at the end it still calls:
return Pattern.compile(regex).split(this, limit);
In your case (with the \\s+
pattern) it does not use any fast-path, so it might be worth to store the compiled Pattern
instance and call split
on that. (I guess it would be faster but the JVM might cache that for you.)
1.
double count = 0d, max = 0d, min = 0d, sum = 0d, sumSquared = 0d;
I'd put the variable declarations to separate lines. From Code Complete 2nd Edition, p759:
With statements on their own lines, the code reads from top to bottom, instead of top to bottom and left to right. When you’re looking for a specific line of code, your eye should be able to follow the left margin of the code. It shouldn’t have to dip into each and every line just because a single line might contain two statements.
Additionally, max
and min
is not used (they are only written), you could safely remove them (or might want to print them to the output).
1.
@Override
public void reduce(Text key, Iterator<DoubleWritable> values,
OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
if (key.equals(new Text("count"))) {
If Text
is thread-safe/immutable you could store new Text("count")
(and the other ones) in fields instead of constructing them on every call for equals as well as for output.collect()
:
output.collect(new Text("sumSquared"), new DoubleWritable(sumSquared));
It looks like that
sum
andsumSquared
have the same implementation in the reducer. Is that a bug? If not you could create a method to eliminate the duplicated logic.sum
,sumSquared
,min
,max
should be constants instead of magic numbers. There are used multiple times.Declaring variables before their usage with a bigger scope looks microoptimization:
String line; String[] tokens; double observation; while (scanner.hasNext()) { line = scanner.nextLine(); tokens = line.split("\\s+"); observation = Double.parseDouble(tokens[1]);
It would be readable declaring them inside the loop. (Effective Java, Second Edition, Item 45: Minimize the scope of local variables)
String line; String[] words; line = br.readLine(); while (line != null) { words = line.split("\\s+");
also could be
String line = br.readLine();
while (line != null) {
final String[] words = line.split("\\s+");
The reduce method contains a lot of similar structures. I'd consider using a function interface with some implementations:
public interface AggregateFunction { void addValue(double value); double getResult(); } public class MaxFunction implements AggregateFunction { private double max = Double.MAX_VALUE; @Override public void addValue(final double value) { max = Math.max(value, max); } @Override public double getResult() { return max; } } public class SumFunction implements AggregateFunction { private double sum = 0.0; @Override public void addValue(final double value) { sum += value; } @Override public double getResult() { return sum; } }
The will reduce the Reducer
:
public class Reduce extends MapReduceBase
implements Reducer<Text, DoubleWritable, Text, DoubleWritable> {
// assuming that Text is thread-safe/immutable
private final Text countKey = new Text("count");
private final Text maxKey = new Text("max");
private final Text minKey = new Text("min");
private final Text sumKey = new Text("sum");
private final Text sumSquaredKey = new Text("sumSquared");
@Override
public void reduce(Text key, Iterator<DoubleWritable> values,
OutputCollector<Text, DoubleWritable> output,
Reporter reporter) throws IOException {
aggregate("count", countKey, values, output, new SumFunction());
aggregate("max", maxKey, values, output, new MaxFunction());
aggregate("min", minKey, values, output, new MinFunction());
aggregate("sum", sumKey, values, output, new SumFunction());
aggregate("sumSquared", sumSquaredKey, values, output,
new SumSquaredFunction());
}
}
line = br.readLine(); while (line != null) { words = line.split("\\s+"); switch (words[0]) { ... } line = br.readLine(); }
I'd restructure the loop for better readability:
while (true) {
final String line = br.readLine();
if (line == null) {
break;
}
final String[] words = line.split("\\s+");
...
}
String line; line = "avg\t" + String.valueOf(avg) + System.lineSeparator(); bw.write(line); line = "variance\t" + String.valueOf(variance) + System.lineSeparator(); bw.write(line);
The line
variable is used for multiple purposes. Using separate variable would make the code easier to read.
final String averageLine = "avg\t" + String.valueOf(avg) + System.lineSeparator();
bw.write(averageLine);
final String varianceLine = "variance\t" + String.valueOf(variance) + System.lineSeparator();
bw.write(varianceLine);
Anyway, using a PrintWriter
would be the best:
bw.printf("avg\t%d%n", avg);
bw.printf("variance\t%d%n", variance);
Instead of commenting write a unit test for it:
// keep in alphabetical order or KABOOM!
It is much safer, especially if you use continuous integration or run unit test automatically on every build.
- There are unused imports:
import org.apache.hadoop.util.*;
It's cleaner omit them. Eclipse could do that for you with a keypress (Ctrl+Shift+O).
1.
public class calcAll {
According to the Code Conventions for the Java Programming Language class names should start with uppercase letters.
I'd try to give a more descriptive name, SomethingStatistics
or something similar.
- Code Conventions for the Java Programming Language, 9 - Naming Conventions
- Effective Java, 2nd edition, Item 56: Adhere to generally accepted naming conventions
- Calling the mapper
Map
is confusing (since there is ajava.util.Map
too). I'd choose something more descriptive. The same is true forReduce
.
Class names should be nouns, not verbs, like Map
or Reduce
.