Commit 17ed262

committed

Slightly Improved Hadoop Word Count Example and Documentation

1 parent d9b2052 commit 17ed262Copy full SHA for 17ed262

File tree

3 files changed

+12

-22

lines changed

hadoop
- README.md
- wordCount
  - pom.xml
  - src/main/java/wordCount
    - WordCountDriver.java

3 files changed

+12

-22

lines changed

`‎hadoop/README.md`

Lines changed: 3 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -20,9 +20,11 @@ If a word occurs multiple times in a line, one token is emitted for each occurre`
`20`	`20`
`21`	`21`	`<Text (the word), Iterable<WriteableInteger> (number of occurrences)>`
`22`	`22`
`23`		`-The reducer also acts as combinator, meaning that a reduction step is also performed locally before the results of each mapper is sent to the central reduction step. It will add up all the occurrences to a single number and thus emit tuples of the form`
	`23`	+Hadoop as put all the `WriteableInteger` generated by the mapping step which belong to the same `Text (the word)` key into an `Iterable` list for us. Thus, for each word that the mapper has discovered, we get a list with numbers. All we have to do is to add them up and emit tuples of the form:
`24`	`24`
`25`	`25`	`<Text (the word), WriteableInteger (total number of occurrences)>`
	`26`	`+`
	`27`	+The reducer here also acts as combinator, meaning that a reduction step is also performed locally before the results of each mapper is sent to the central reduction step. This way we can already add up some word counts locally and the amount of data that needs to be sent to the central reducer decreases, as two tuples for the same word are already merged. This is possible in this simple form because the output of the reducer is the same as the output of the mapper, just that the `WriteableInteger` part will not necessarily have value `1` afterwards.
`26`	`28`
`27`	`29`	`After the reduction step, we therefore know how often each word occurred in the text. Furthermore, since the tuples are sorted automatically before reduction, the word/occurrences list is also nicely sorted alphabetically.`
`28`	`30`

`‎hadoop/wordCount/pom.xml`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -36,7 +36,7 @@`
`36`	`36`	`<project.build.sourceEncoding>${encoding}</project.build.sourceEncoding>`
`37`	`37`	`<project.reporting.outputEncoding>${encoding}</project.reporting.outputEncoding>`
`38`	`38`	`<jdk.version>1.7</jdk.version>`
`39`		`-<project.mainClass>wordCount.WordCountDriver</project.mainClass>`
	`39`	`+<project.mainClass>wordCount.WordCountDriver</project.mainClass>`
`40`	`40`	`</properties>`
`41`	`41`
`42`	`42`	`<licenses>`

`‎hadoop/wordCount/src/main/java/wordCount/WordCountDriver.java`

Lines changed: 8 additions & 20 deletions

Original file line number	Diff line number	Diff line change
`@@ -16,21 +16,17 @@`
`16`	`16`	`public class WordCountDriver extends Configured implements Tool {`
`17`	`17`
`18`	`18`	`public static void main(final String[] args) throws Exception {`
`19`		`- try {`
`20`		`- final int res = ToolRunner.run(new Configuration(),`
`21`		`- new WordCountDriver(), args);`
`22`		`- System.exit(res);`
`23`		`- } catch (final Exception e) {`
`24`		`- e.printStackTrace();`
`25`		`- System.exit(255);`
`26`		`- }`
	`19`	`+ System.exit(ToolRunner.run(new Configuration(), //`
	`20`	`+ new WordCountDriver(), args));`
`27`	`21`	`}`
`28`	`22`
`29`	`23`	`@Override`
`30`	`24`	`public int run(final String[] args) throws Exception {`
	`25`	`+ final Configuration conf;`
	`26`	`+ final Job job;`
`31`	`27`
`32`		`- finalConfigurationconf = new Configuration();`
`33`		`- finalJobjob = Job.getInstance(conf, "Your job name");`
	`28`	`+ conf = new Configuration();`
	`29`	`+ job = Job.getInstance(conf, "Word Count Map-Reduce");`
`34`	`30`
`35`	`31`	`job.setJarByClass(WordCountDriver.class);`
`36`	`32`
`@@ -39,25 +35,17 @@ public int run(final String[] args) throws Exception {`
`39`	`35`	`}`
`40`	`36`
`41`	`37`	`job.setMapperClass(WordCountMapper.class);`
`42`		`-`
`43`		`- // job.setMapOutputKeyClass(Text.class);`
`44`		`- // job.setMapOutputValueClass(IntWritable.class);`
`45`		`-`
`46`	`38`	`job.setReducerClass(WordCountReducer.class);`
`47`	`39`	`job.setCombinerClass(WordCountReducer.class);`
`48`	`40`
`49`	`41`	`job.setOutputKeyClass(Text.class);`
`50`	`42`	`job.setOutputValueClass(IntWritable.class);`
`51`	`43`
`52`	`44`	`job.setInputFormatClass(TextInputFormat.class);`
`53`		`-`
`54`	`45`	`job.setOutputFormatClass(TextOutputFormat.class);`
`55`	`46`
`56`		`- final Path filePath = new Path(args[0]);`
`57`		`- FileInputFormat.setInputPaths(job, filePath);`
`58`		`-`
`59`		`- final Path outputPath = new Path(args[1]);`
`60`		`- FileOutputFormat.setOutputPath(job, outputPath);`
	`47`	`+ FileInputFormat.setInputPaths(job, new Path(args[0]));`
	`48`	`+ FileOutputFormat.setOutputPath(job, new Path(args[1]));`
`61`	`49`
`62`	`50`	`job.waitForCompletion(true);`
`63`	`51`	`return 0;`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 17ed262

File tree

3 files changed

3 files changed

`‎hadoop/README.md`

`‎hadoop/wordCount/pom.xml`

`‎hadoop/wordCount/src/main/java/wordCount/WordCountDriver.java`

0 commit comments