Skip to main content

Hadoop Installation Document - Standalone Mode

This document shows my experience on following apache document titled “Hadoop:Setting up a Single Node Cluster”[1] which is for Hadoop version 3.0.0-Alpha2 [2].

A. Prepare the guest environment

  1. Install VirtualBox.
  2. Create a virtual 64 bit Linux machine. Name it “ubuntul_hadoop_master”. Give it 500MB memory. Create a VMDK disc which is dynamically allocated up to 30GB.
  3. In network settings in first tab you should see Adapter 1 enabled and attached to “NAT”. In second table enable adapter 2 and attach to “Host Only Adaptor”. First adapter is required for internet connection. Second one is required for letting outside connect to a guest service.
  4. In storage settings, attach a Linux iso file to IDE channel. Use any distribution you like. Because of small installation size, I choose minimal Ubuntu iso [1]. In package selection menu, I only left standard packages selected. 
  5. Login to system. 
  6. Setup JDK.

  7.   $ sudo apt-get install openjdk-8-jdk
    

  8. Install ssh and pdsh, if not already installed.

  9.   $ sudo apt-get install ssh
      $ sudo apt-get install pdsh
    

    If you see “rcmd: socket: Permission denied” error during execution of commands below, try to run following command to fix the problem. 
      $ echo "ssh" > /etc/pdsh/rcmd_default
    

B. Install Hadoop

  1. Download Hadoop distribution 3.0.0-Alpha2.

  2.   $ wget ftp://ftp.itu.edu.tr/Mirror/Apache/hadoop/common/hadoop-3.0.0-alpha2/hadoop-3.0.0-alpha2.tar.gz
    

  3. Extract archive file.

  4.   $ tar -xf hadoop-3.0.0-alpha2.tar.gz
    

  5. Now switch to Hadoop directory and edit file etc/hadoop/hadoop-env.sh Find, uncomment and set JAVA_HOME like below. This will tell Hadoop where java installation is. Be careful here, your java installation folder may be different. 

  6.    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    

  7. Try following command to make sure everything is fine.Help messages should be output.

Per [1], there are three working modes. First mode is standalone mode in which Hadoop runs as a single java process in non-distributed fashion without HDFS or YARN. I won’t try it. You may have look at [1]. 


C. Pseudo-Distributed Mode


In this mode, Hadoop instances runs in a single node. You need to be able to SSH localhost without entering password. To figure out, simple type “ssh localhost” and It should login without asking any password. Configure passphrases-less SSH by running following commands.


  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys


To make sure that this configuration is successful, just type “ssh localhost”again and you should login without any password asked.

Following steps will make MapReduce jobs run locally. Since in this mode there is single node, working local is relevant.

  1. Edit etc/hadoop/core-site.xml file and make sure it has fs.defaultFS property. Document [1] suggests using localhost, but I use IP address of the machine. To find out ip address type “ifconfig”. If you are using VirtualBox with host only network, guest IP will have form like 192.168.56.*.

    * Using IP address will ease our job when we clone this virtual machine to set up slave nodes. This way, we won’t need to change fs.defaultFS value at each node.

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://192.168.56.103:9000</value>
        </property>
    </configuration>


  2. Edit etc/hadoop/hdfs-site.xml file and make sure it has dfs.replication property.

    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    

  3. Format file system. A cluster id can be given as last argument, which is optional.

  4.    $ bin/hdfs namenode -format

  5. It is recommended to start HDFS daemons as a separate user. I create a user named “hdfs” and assign password to it.

  6.     $ useradd hdfs
        $ passwd
        $ su - hdfs
    

  7. Start NameNode daemon and DataNode daemon. NameNode daemon is the central daemon thah manages HDFS. DataNode stores the actual data.

  8.   $ sbin/start-dfs.sh
    

    To make sure that daemons started successfully, type “jps” and you should see DataNode, NameNode and SecondayNameNode listed. If your listing is different, you should check log files inside logs directory to learn what went wrong. If you find file state exception; stop running daemons with stop-dfs.sh, remove Hadoop directories under /tmp directory and re-run the format command at the previous instruction, before trying this command again.  Alternatively, instead of utility script start-dfs.sh, you may run HDFS processes separately.

      $ bin/hdfs --daemon start namenode
      $ bin/hdfs --daemon start datanode
    


    NameNode web interface listens at port 9887, by default.


  9. Once HDFS processes are ready, you may run file operations on HDFS.

  10.   $ bin/hdfs dfs -mkdir /tmp
      $ bin/hdfs dfs -ls /
    

  11. You can run MapReduce jobs here.
  12. To stop HDFS, run following command.

  13.   $ sbin/stop-dfs.sh
    

    Alternatively, you may stop HDFS processes separately.

      $ bin/hdfs --daemon stop namenode
      $ bin/hdfs --daemon stop datanode


D. Running Pseudo-Distributed Mode on YARN


It is possible to run MapReduce jobs on YARN in pseudo-distributed mode. You need to start ResourceManager and NodeManager daemons after starting HDFS jobs, as explained above.
Document [1] instructs some xml configurations but my experience is that they are not mandatory. For minimum configuration steps, I skip them.


  1. It is recommended to start yarn daemons as a separate user. I create a user named yarn and assign password to it.

  2.   $ useradd yarn
      $ passwd
      $ su - yarn
    

  3. Start NodeManager and ResourceManager daemons.

  4.   $ sbin/start-yarn.sh
    

    Result of JSP command should contain 5 processes now: NameNode, SecondaryNameNode, DataNode, ResourceManager, NodeManager. If there is mission processes check logs and all previous steps.
    Alternatively, instead of utility script start-dfs.sh, you may run HDFS processes separately.

      $ bin/yarn --daemon start resourcemanager
      $ bin/yarn --daemon start nodemanager
    

    ResourceManager web interface can be accessed at 8088.

  5. You can run MapReduce jobs.
  6. To stop Hadoop, first shutdown HDFS jobs as explained above. Then stop yarn jobs using the utility script.

  7.   $ sbin/stop-yarn.sh
    

    Alternatively yarn jobs can be stopped separately.

      $ bin/yarn --daemon stop resourcemanager
      $ bin/yarn --daemon stop nodemanager
    



    For completeness yarn commands are listed below. Assume HDFS is formatted.

      $ su - hdfs
      $ sbin/start-dfs.sh
      $ su – yarn
      $ sbin/start-yarn.sh
    
      Run MapReduce jobs here
    
      $ su - hdfs
      $ sbin/stop-dfs.sh 
      $ su – yarn
      $ sbin/stop-yarn.sh
    

E. Links

Comments

Post a Comment

Popular posts from this blog

Obfuscating Spring Boot Projects Using Maven Proguard Plugin

Introduction Obfuscation is the act of reorganizing bytecode such that it becomes hard to decompile. Many developers rely on obfuscation to save their sensitive code from undesired eyes. Publishing jars without obfuscation may hinder competitiveness because rivals may take advantage of easily decompilable nature of java binaries. Objective Spring Boot applications make use of public interfaces, annotations which makes applications harder to obfuscate. Additionally, maven Spring Boot plugin creates a fat jar which contains all dependent jars. It is not viable to obfuscate the whole fat jar. Thus obfuscating Spring Boot applications is different than obfuscating regular java applications and requires a suitable strategy. Audience Those who use Spring Boot and Maven and wish to obfuscate their application using Proguard are the target audience for this article. Sample Application As the sample application, I will use elastic search synch application from my GitHub repository.

Java: Cost of Volatile Variables

Introduction Use of volatile variables is common among Java developers as a way of implicit synchronization. JIT compilers may reorder program execution to increase performance. Java memory model[1] constraints reordering of volatile variables. Thus volatile variable access should has a cost which is different than a non-volatile variable access. This article will not discuss technical details on use of volatile variables. Performance impact of volatile variables is explored by using a test application. Objective Exploring volatile variable costs and comparing with alternative approaches. Audience This article is written for developers who seek to have a view about cost of volatile variables. Test Configuration Test application runs read and write actions on java variables. A non volatile primitive integer, a volatile primitive integer and an AtomicInteger is tested. Non-volatile primitive integer access is controlled with ReentrantLock and ReentrantReadWriteLock  to compa