Facebook Twitter Instagram
    TecAdmin
    • Home
    • FeedBack
    • Submit Article
    • About Us
    Facebook Twitter Instagram
    TecAdmin
    You are at:Home»BIG-DATA»How To Install and Configure Hadoop on CentOS/RHEL 8

    How To Install and Configure Hadoop on CentOS/RHEL 8

    By Hitesh JethvaFebruary 11, 20207 Mins ReadUpdated:February 12, 2020

    Hadoop is a free, open-source and Java-based software framework used for storage and processing of large datasets on clusters of machines. It uses HDFS to store its data and process these data using MapReduce. It is an ecosystem of Big Data tools that are primarily used for data mining and machine learning. It has four major components such as Hadoop Common, HDFS, YARN, and MapReduce.

    Advertisement

    In this guide, we will explain how to install Apache Hadoop on RHEL/CentOS 8.

    Step 1 – Disable SELinux

    Before starting, it is a good idea to disable the SELinux in your system.

    To disable SELinux, open the /etc/selinux/config file:

    nano /etc/selinux/config
    

    Change the following line:

    SELINUX=disabled
    

    Save the file when you are finished. Next, restart your system to apply the SELinux changes.

    Step 2 – Install Java

    Hadoop is written in Java and supports only Java version 8. You can install OpenJDK 8 and ant using DNF command as shown below:

    dnf install java-1.8.0-openjdk ant -y
    

    Once installed, verify the installed version of Java with the following command:

    java -version
    

    You should get the following output:

    openjdk version "1.8.0_232"
    OpenJDK Runtime Environment (build 1.8.0_232-b09)
    OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
    

    Step 3 – Create a Hadoop User

    It is a good idea to create a separate user to run Hadoop for security reasons.

    Run the following command to create a new user with name hadoop:

    useradd hadoop
    

    Next, set the password for this user with the following command:

    passwd hadoop
    

    Provide and confirm the new password as shown below:

    Changing password for user hadoop.
    New password: 
    Retype new password: 
    passwd: all authentication tokens updated successfully.
    

    Step 4 – Configure SSH Key-based Authentication

    Next, you will need to configure passwordless SSH authentication for the local system.

    First, change the user to hadoop with the following command:

    su - hadoop
    

    Next, run the following command to generate Public and Private Key Pairs:

    ssh-keygen -t rsa
    

    You will be asked to enter the filename. Just press Enter to complete the process:

    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
    Created directory '/home/hadoop/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    The key fingerprint is:
    SHA256:a/og+N3cNBssyE1ulKK95gys0POOC0dvj+Yh1dfZpf8 [email protected]
    The key's randomart image is:
    +---[RSA 2048]----+
    |                 |
    |                 |
    |              .  |
    |     .   o o o   |
    |  . . o S o o    |
    | o = + O o   .   |
    |o * O = B =   .  |
    | + O.O.O + +   . |
    |  +=*oB.+ o     E|
    +----[SHA256]-----+
    

    Next, append the generated public keys from id_rsa.pub to authorized_keys and set proper permission:

    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    chmod 640 ~/.ssh/authorized_keys
    

    Next, verify the passwordless SSH authentication with the following command:

    ssh localhost
    

    You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the localhost:

    The authenticity of host 'localhost (::1)' can't be established.
    ECDSA key fingerprint is SHA256:0YR1kDGu44AKg43PHn2gEnUzSvRjBBPjAT3Bwrdr3mw.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
    Activate the web console with: systemctl enable --now cockpit.socket
    
    Last login: Sat Feb  1 02:48:55 2020
    [[email protected] ~]$ 
    

    Step 5 – Install Hadoop

    First, change the user to hadoop with the following command:

    su - hadoop
    

    Next, download the latest version of Hadoop using the wget command:

    wget http://apachemirror.wuchna.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
    

    Once downloaded, extract the downloaded file:

    tar -xvzf hadoop-3.2.1.tar.gz
    

    Next, rename the extracted directory to hadoop:

    mv hadoop-3.2.1 hadoop
    

    Next, you will need to configure Hadoop and Java Environment Variables on your system.

    Open the ~/.bashrc file in your favorite text editor:

    nano ~/.bashrc
    

    Append the following lines:

    export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.232.b09-2.el8_1.x86_64/
    export HADOOP_HOME=/home/hadoop/hadoop
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export HADOOP_YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
    

    Save and close the file. Then, activate the environment variables with the following command:

    source ~/.bashrc
    

    Next, open the Hadoop environment variable file:

    nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
    

    Update the JAVA_HOME variable as per your Java installation path:

    export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.232.b09-2.el8_1.x86_64/
    

    Save and close the file when you are finished.

    Step 6 – Configure Hadoop

    First, you will need to create the namenode and datanode directories inside Hadoop home directory:

    Run the following command to create both directories:

    mkdir -p ~/hadoopdata/hdfs/namenode
    mkdir -p ~/hadoopdata/hdfs/datanode
    

    Next, edit the core-site.xml file and update with your system hostname:

    nano $HADOOP_HOME/etc/hadoop/core-site.xml
    

    Change the following name as per your system hostname:

    1
    2
    3
    4
    5
    6
    <configuration>
            <property>
                    <name>fs.defaultFS</name>
                    <value>hdfs://hadoop.tecadmin.com:9000</value>
            </property>
    </configuration>

    Save and close the file. Then, edit the hdfs-site.xml file:

    nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
    

    Change the NameNode and DataNode directory path as shown below:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    <configuration>
     
            <property>
                    <name>dfs.replication</name>
                    <value>1</value>
            </property>
     
            <property>
                    <name>dfs.name.dir</name>
                    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
            </property>
     
            <property>
                    <name>dfs.data.dir</name>
                    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
            </property>
    </configuration>

    Save and close the file. Then, edit the mapred-site.xml file:

    nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
    

    Make the following changes:

    1
    2
    3
    4
    5
    6
    <configuration>
            <property>
                    <name>mapreduce.framework.name</name>
                    <value>yarn</value>
            </property>
    </configuration>

    Save and close the file. Then, edit the yarn-site.xml file:

    nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
    

    Make the following changes:

    1
    2
    3
    4
    5
    6
    <configuration>
            <property>
                    <name>yarn.nodemanager.aux-services</name>
                    <value>mapreduce_shuffle</value>
            </property>
    </configuration>

    Save and close the file when you are finished.

    Step 7 – Start Hadoop Cluster

    Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.

    Run the following command to format the hadoop Namenode:

    hdfs namenode -format
    

    You should get the following output:

    2020-02-05 03:10:40,380 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
    2020-02-05 03:10:40,389 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
    2020-02-05 03:10:40,389 INFO namenode.NameNode: SHUTDOWN_MSG: 
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at hadoop.tecadmin.com/45.58.38.202
    ************************************************************/
    

    After formating the Namenode, run the following command to start the hadoop cluster:

    start-dfs.sh
    

    Once the HDFS started successfully, you should get the following output:

    Starting namenodes on [hadoop.tecadmin.com]
    hadoop.tecadmin.com: Warning: Permanently added 'hadoop.tecadmin.com,fe80::200:2dff:fe3a:26ca%eth0' (ECDSA) to the list of known hosts.
    Starting datanodes
    Starting secondary namenodes [hadoop.tecadmin.com]
    

    Next, start the YARN service as shown below:

    start-yarn.sh
    

    You should get the following output:

    Starting resourcemanager
    Starting nodemanagers
    

    You can now check the status of all Hadoop services using the jps command:

    jps
    

    You should see all the running services in the following output:

    7987 DataNode
    9606 Jps
    8183 SecondaryNameNode
    8570 NodeManager
    8445 ResourceManager
    7870 NameNode
    

    Step 8 – Configure Firewall

    Hadoop is now started and listening on port 9870 and 8088. Next, you will need to allow these ports through the firewall.

    Run the following command to allow Hadoop connections through the firewall:

    firewall-cmd --permanent --add-port=9870/tcp
    firewall-cmd --permanent --add-port=8088/tcp
    

    Next, reload the firewalld service to apply the changes:

    firewall-cmd --reload
    

    Step 9 – Access Hadoop Namenode and Resource Manager

    To access the Namenode, open your web browser and visit the URL http://your-server-ip:9870. You should see the following screen:

    To access the Resource Manage, open your web browser and visit the URL http://your-server-ip:8088. You should see the following screen:

    Step 10 – Verify the Hadoop Cluster

    At this point, the Hadoop cluster is installed and configured. Next, we will create some directories in HDFS filesystem to test the Hadoop.

    Let’s create some directory in the HDFS filesystem using the following command:

    hdfs dfs -mkdir /test1
    hdfs dfs -mkdir /test2
    

    Next, run the following command to list the above directory:

    hdfs dfs -ls /
    

    You should get the following output:

    Found 2 items
    drwxr-xr-x   - hadoop supergroup          0 2020-02-05 03:25 /test1
    drwxr-xr-x   - hadoop supergroup          0 2020-02-05 03:35 /test2
    

    You can also verify the above directory in the Hadoop Namenode web interface.

    Go to the Namenode web interface, click on the Utilities => Browse the file system. You should see your directories which you have created earlier in the following screen:

    Step 11 – Stop Hadoop Cluster

    You can also stop the Hadoop Namenode and Yarn service any time by running the stop-dfs.sh and stop-yarn.sh script as a Hadoop user.

    To stop the Hadoop Namenode service, run the following command as a hadoop user:

    stop-dfs.sh 
    

    To stop the Hadoop Resource Manager service, run the following command:

    stop-yarn.sh
    

    Conclusion

    In the above tutorial, you learned how to set up the Hadoop single node cluster on CentOS 8. I hope you have now enough knowledge to install the Hadoop in the production environment.

    hadoop
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email WhatsApp

    Related Posts

    Using HDFS Filesystem (CLI)

    Creating Directory In HDFS And Copy Files (Hadoop)

    How to Install Hadoop on Ubuntu 22.04

    How to Install Apache Hadoop on Ubuntu 22.04

    How to Install and Configure Hadoop on Ubuntu 20.04

    View 12 Comments

    12 Comments

    1. Gerardo on November 15, 2022 4:18 pm

      Everything works perfectly with the hadoop user.
      But if I change the user, the command to view the folders created in hdfs does not identify me.
      How can I enable other users to see the hdfs folders?

      Reply
      • Marcelo on January 4, 2023 2:37 pm

        Environment variables are set for hadoop user only.

        Reply
    2. Murali on February 17, 2022 7:34 am

      Good article to setup hadoop

      Reply
    3. Eduardo Andrés Chavarría Rey on October 16, 2021 10:57 am

      It’s a good guide, but some things are wrong.
      Like in the hdfs-default.xml:

      Where it is not

      dfs.replication
      1

      dfs.name.dir
      file:///home/hadoop/hadoopdata/hdfs/namenode

      dfs.data.dir
      file:///home/hadoop/hadoopdata/hdfs/datanode

      But

      dfs.replication
      1

      dfs.namenode.name.dir
      file:///home/hadoop/hadoopdata/hdfs/namenode

      dfs.datatanode.data.dir
      file:///home/hadoop/hadoopdata/hdfs/datanode

      Reply
      • Rohit on October 25, 2021 6:21 pm

        Thanks, I missed out this thing while installing, Previously it was getting error while writing in Hadoop file.

        Reply
    4. Pyae Zaw on July 10, 2021 12:13 pm

      Really helpful article. But i got some error. I am using RHEL 8 .
      When i type ‘ jps ‘ = command not found error was shown.
      When i start dfs.sh, and yarn, below error was shown and permission denied.
      Starting namenodes on [hadoop.tecadmin.com]
      hadoop.tecadmin.com: ssh: connect to host hadoop.tecadmin.com port 22: Connection timed out
      Starting datanodes
      localhost: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
      Starting secondary namenodes [localhost.localdomain]
      localhost.localdomain: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

      How can i solve? thanks

      Reply
    5. Phuoc on June 26, 2021 7:07 pm

      Hi, when I type $jps. the command outputs are :

      23313 SecondaryNameNode
      25570 NodeManager
      23014 DataNode
      38670 Jps
      25407 ResourceManager

      I don’t have the NameNode, where is my error?

      Reply
    6. bmr on January 24, 2021 11:40 am

      excellent, working perfectly. Thanks

      Reply
    7. Ron on November 23, 2020 1:53 am

      Any author starting with “disable SELinux on your system” … hard no.

      Reply
    8. prithvi on November 3, 2020 9:44 am

      after typing “hdfs namenode -format”
      i get “bash:hdfs command not found…”
      help me

      Reply
    9. Darshana Samanpura on May 4, 2020 2:38 pm

      Really good article, everything works as described here.

      Reply
    10. shane on February 17, 2020 5:19 pm

      “it is a good idea to disable the SELinux in your system….” stop reading…

      Reply

    Leave A Reply Cancel Reply

    Advertisement
    Recent Posts
    • Error: EACCES: permission denied, scandir (Resolved)
    • How To Install Python 3.11 on Ubuntu 22.04 / 20.04
    • How to Install Python 3.11 on Amazon Linux 2
    • An Introduction to the “./configure” Command: Compiling Source Code in Linux
    • How to Install PHP 8.x on Pop!_OS
    Facebook Twitter Instagram Pinterest
    © 2023 Tecadmin.net. All Rights Reserved | Terms  | Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.