WordCount

Different programs which count one-word frequency, one-word frequency using one list as a source with Distributed Cache and double-word frequency in a given file.

Name of the programs

OneWordCount.jar OneWordCount.java & class #This is the driver class OneWordMapper.java & class #This is the mapper class SumReducer.java & class #This is the reducer class
DoubleWordCount.jar DoubleWordCount.java & class #This is the driver class DoubleWordMapper.java & class #This is the mapper class SumReducer.java & class #This is the reducer class
CacheWordCount.jar WordCount.java & class #This is the driver class WordMapper.java & class #This is the mapper class SumReducer.java & class #This is the reducer class

Program Description

One Word Count - This program takes a list of input files and generates files containing words and their respective counts in the given input files.

a. OneWordCount.java This is the driver class. In the driver class, a Job instance in initialized and is configured. Various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc. are configured in the Job Configuration. It then calls the waitForCompletion to submit the job and monitor its progress.

b. OneWordMapper.java This class is the mapper class. It receives a key and single line as value in input. The line is then tokenized into words and for each word a key/value pair of the form <word, 1> is set in the Job Context.

c. SumReducer.java This class is the reducer class. It receives a key- a single word and a list of counts for that word. It will then sum up the counts to produce the final output key/value pair as <word, count>

Double Word Count - This program takes a list of input files and generates files containing double words and their respective counts in the given input files.

a. DoubleWordCount.java This program is same as OneWordCount.java which is the driver class for the Double Word Count Job.

b. DoubleWordMapper.java This class is the mapper class for Double Word Count Job. It differs from One Word Mapper in that it outputs a key/value pair of the form <previousWord currentWord, 1>. The previous token is initialized before the while loop to the first word for the line. For every next word in the while loop, we form a new key – ‘previous token space current token’. And this key is set with the value 1 in the context. The current token now becomes the previous token.

c. SumReducer.java This class is similar to One Word Count Job reducer class and it just sums up the counts for the double word pairs.

Distributed Cache Word Count

a. WordCount.java This is the main class. It implements Tool interface to handle custom arguments to the program. The run method is called by the ToolRunner with the command-line arguments. The run method loops through the commandline arguments to find the location for the small word list file. It then sets this file for distributed caching in the job context by calling addCacheFile method on the job instance.

b. WordMapper.java This class is the mapper class. In the setup method of the Mapper, the location of the cached file in the local disk of the task node is fetched using the method getLocalCacheFiles on the context object. The file is read from this location by the parseSearchFile method and a list of words present in the file is stored in an array. The map method then uses this list of words to output count of only those words present in this list.

c. SumReducer.java This is the reducer class and is similar to the One Word Count Job class.

Example commands for the jobs on futuregrid

bin/hadoop jar OneWordCount.jar OneWordCount Data/input Output

bin/hadoop jar DoubleWordCount.jar DoubleWordCount Data/input Output

bin/hadoop jar CacheWordCount.jar CacheWordCount –searchList Data/word-patterns.txt Data/input Output

Results

One Word Count Job – The output of the job is n number of part files where n is the number of reduce tasks. Each file contains a list of <key,value> pairs in sorted order. The format of the <key,value> pair is <word,count>. E.g. character 180. The word character appears a total of 180 times in the 10 input files.
Double Word Count Job - The output of the job is n number of part files where n is the number of reduce tasks. Each file contains a list of <key,value> pairs in sorted order. The format of the <key,value> pair is <word1 word2,count>. Thus it gives the count of double word pairs in the input file. E.g. all with 230. The two words ‘all’and ‘with’ appear together a total of 230 times in the 10 input files.
Distributed Cache Word Count Job - The output of the job is n number of part files where n is the number of reduce tasks. Each file contains a list of <key,value> pairs in sorted order. The format of the <key,value> pair is <word,count>. The output contains only those words which are present in the word-patterns.txt E.g. behold 14990. The word behold from the word-patterns file appears a total of 14990 times in the 10 input files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WordCount

Name of the programs

Program Description

Example commands for the jobs on futuregrid

Results

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Distributed Cache		Distributed Cache
Double Word Count		Double Word Count
One Word Count		One Word Count
README.md		README.md

saili512/WordCount

Folders and files

Latest commit

History

Repository files navigation

WordCount

Name of the programs

Program Description

Example commands for the jobs on futuregrid

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages