This project demonstrates how to set up a Hadoop environment and implement a simple WordCount application using Java and Hadoop's MapReduce framework. The purpose of the WordCount application is to count the occurrences of each word in a given text file.
- Xubuntu (or any Ubuntu or Linux-based OS) installed on VirtualBox
- OpenJDK 8
- Hadoop 3.3.6
- Maven
- IntelliJ IDEA
Run the following command to update the package manager:
sudo apt updateProceed to install OpenJDK 8, which is the open-source implementation of the Java Platform.
sudo apt install openjdk-8-jdkWhen prompted for Yes/No, press Y to allow the installation to proceed.
java -versionIf the installation is successful, the above command will display the openjdk version.
Hadoop needs to know where Java is installed, so I need to set the JAVA_HOME environment variable. Find the Java path using the following command:
dirname $(dirname $(readlink -f $(which java)))Edit .bashrc:
nano ~/.bashrcAdd the following and save it:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64Now, reload the settings:
source ~/.bashrcwget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoopAfter the download was completed, I unzipped the downloaded file using:
tar -xvzf hadoop-3.3.6.tar.gz To make things easier to navigate, I renamed the folder from hadoop-3.3.6 to just hadoop:
mv hadoop-3.3.6 hadoop Now lets move on to set up the environment variables.
Edit .bashrc:
nano ~/.bashrcAdd the following and save it:
export HADOOP_HOME=~/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinReload the settings:
source ~/.bashrcThis applied the new environment settings for Hadoop.
Now that Hadoop was installed, I needed to configure several important files located in the ~/hadoop/etc/hadoop/ directory for the functioning of MapReduce properly.
nano ~/hadoop/etc/hadoop/core-site.xmlAdd the following in the core-site.xml file and save it:
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>Next, configure the hdfs-site.xml file to set up the HDFS (Hadoop Distributed File System) inside the Hadoop directory:
nano ~/hadoop/etc/hadoop/hdfs-site.xmlAdd these properties and save it:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/akb/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/akb/hadoop/data/datanode</value>
</property>Now create directories for DataNode and NameNode:
mkdir -p ~/hadoop/data/namenode
mkdir -p ~/hadoop/data/datanodeNow configure the MapReduce framework by configuring the mapred-site.xml file in the hadoop directory.
nano ~/hadoop/etc/hadoop/mapred-site.xmlNow add and save the following properties:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>Now configure YARN, Hadoop's resource manager inside the Hadoop directory.
nano ~/hadoop/etc/hadoop/yarn-site.xmlAdd and save the following:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>Hadoop requires SSH to communicate between different nodes. I need to set up SSH on my virtual machine.
ssh-keygen -t rsaPress Enter on each prompted asked, until the ssh key is generated.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keysssh localhostBefore starting Hadoop, I needed to format the NameNode.
hdfs namenode -formatNow that everything was set up, I proceeded to start Hadoop.
start-dfs.shstart-yarn.shstart-all.shTo make sure everything was running, I used the jps command.
jpsThis command listed running Hadoop services like NameNode, DataNode, ResourceManager, and NodeManager.
Stop all services directly using
stop-all.shOR separately
stop-dfs.shstop-yarn.shFinally, to access the Hadoop web interface and check the status of my cluster, I opened the web browser and went to:
http://localhost:9870This brought up the Hadoop cluster summary page, confirming that everything was working as expected.
By following this guide, you can successfully set up and configire Java and Hadoop, create a WordCount project, and run it on a distributed environment. The WordCount example demonstrates the basics of Hadoop's MapReduce framework.
