This recipe is based on the guide by Jonas Widrikson. He did a fantastic job, but left some details out which I will attempt to correct. If you want a more in depth description on what Hadoop is and how it works, be sure to read his blog.
This recipe assumes you have a basic knowledge of Unix/Linux and that you know how to setup your Pi to accessible over SSH (PuTTY)
Throughout this recipe I will refer to Pi's as either the Master or a Slave. The Master Pi is responsible for maintaining the HDFS (Hadoop Distributed File System) and for sending out work to the other Pi's (the Slaves).
We will start by configuring one Pi and then you can clone the SD and use it on the other Pi's.
iface eth0 inet dhcp
iface eth0 inet static address 192.168.1.XXX netmask 255.255.255.0 gateway: 192.168.0.1
where XXX makes a unique address across all the computers on your network. I used 123.
Then go and add a entry in the /etc/hosts that allows you to easily reference this address:
After this is done, restart the Pi.
java -version java version"1.8.0"Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) Client VM (build 25.0-b70, mixed mode)
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser sudo adduser hduser sudo
The 3 Pi's have to be able to talk to each other without having to use passwords. You do this by using a SSH RSA keypair with no password.
su hduser mkdir ~/.ssh ssh-keygen -t rsa -P""cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
You can check to see if it worked by doing this:
su hduser ssh localhost
If you don't get any errors, you are good to go. If it prompts you to trust the key, type 'yes'.
Make sure that you close this shell and that you are back in the pi/root shell.
cd ~/ wget http://apache.mirrors.spacedump.net/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz sudo mkdir /opt sudo tar -xvzf hadoop-1.2.1.tar.gz -C /opt/ cd /opt sudo mv hadoop-1.2.1 hadoop sudo chown -R hduser:hadoop hadoop
Add the following lines to the end of the /etc/bash.bashrc file:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed"s:bin/java::") export HADOOP_INSTALL=/opt/hadoopexport PATH=$PATH:$HADOOP_INSTALL/bin
We need to make sure that we can run the hadoop command outside its binary folder (/opt/hadoop/bin)
exit su hduser hadoop version <span class="redactor-invisible-space"></span>hduser@node1 /home/hduser $ hadoop version Hadoop 1.2.1 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152 Compiled by mattf on Mon Jul 22 15:23:09 PDT 2013 From source with checksum 6923c86528809c4e7e6f493b6b413a9a This command was run using /opt/hadoop/hadoop-core-1.2.1.jar
Now that we have Hadoop installed, we need to configure the environment to run on the Pi correctly.
As either root or sudo edit /opt/hadoop/conf/hadoop-env.sh to be:
# The java implementation to use. Required. export JAVA_HOME=$(readlink -f /usr/bin/java | sed"s:bin/java::") # The maximum amount of heap to use, in MB. Default is 1000.export HADOOP_HEAPSIZE=250 # Command specific options appended to HADOOP_OPTS when specified export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTSi <strong>-client</strong>"
Note 1: If you forget to add the -client option to HADOOP_DATANODE_OPTS you will get the following error messge in hadoop-hduser-datanode-node1.out:
Error occurred during initialization of VM Server VM is only supported on ARMv7+ VFP
Note 2: If you run SSH on a different port than 22 then you need to change the following parameter:
# Extra ssh options. Empty by default. # export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"export HADOOP_SSH_OPTS="-p <YOUR_PORT>"
Or you will get the error:
connect to host localhost port 22: Address family not supported by protocol
In /opt/hadoop/conf edit the following configuration files:
This file is used to configure where Hadoop stores its temp files and what the IP address and part of the HDFS is. Note that we used node1 here. This allows us to easily clone this across to the slaves in the future.
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/hdfs/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://node1:54310</value> </property> </configuration>
This file is used to configure the MapReduce job tracker system. This is the part of Hadoop responsible for actually doing work.
<configuration> <property> <name>mapred.job.tracker</name> <value>node1:54311</value> </property> </configuration>
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>
Now that we have everything setup, it's time to boot up the distributed file system!
sudo mkdir -p /hdfs/tmpsudo chown hduser:hadoop /hdfs/tmpsudo chmod 750 /hdfs/tmphadoop namenode -format
if you run the jps command, you should see something like this:
jps16640 JobTracker16832 Jps16307 NameNode16550 SecondaryNameNode16761 TaskTracker16426 DataNode
If you cannot see all of the processes above review the log files in /opt/hadoop/logs to find the source of the problem.
Copy a file to the HDFS:
hadoop dfs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txthadoop jar /opt/hadoop/hadoop-examples-1.2.1.jar wordcount /license.txt /license-out.txthadoop dfs -copyToLocal /license-out.txt ~/
If you open ~/license-out.txt/part-r-00000, you'll see the word count for the license.txt file.
To be continued!