WikidataEntitySuggester/SoftwareSetup
Create a directory where you'd like to do all Entity Suggester work and setup.
Tomcat Setup
editLet's begin by setting up Tomcat.
wget http://archive.apache.org/dist/tomcat/tomcat-7/v7.0.39/bin/apache-tomcat-7.0.39.tar.gz tar xzf apache-tomcat-7.0.39.tar.gz cd apache-tomcat-7.0.39 mkdir ROOT_backup mv webapps/ROOT* ./ROOT_backup/ rm -rf work
Set some JVM parameters for Tomcat (Set the heap according to your available resources, but at least 4GB is recommended. You might need 5-6 GB to support both the Myrrix instances and the REST API, not including the memory you should keep aside for Hadoop):
echo 'export CATALINA_OPTS="-Xmx6g -XX:NewRatio=12 $CATALINA_OPTS"' > bin/setenv.sh
There, it's ready.
Hadoop Setup
editThere is no dearth of info on Hadoop on the internet, it’s easy to find helpful tutorials and beginner’s guides on what it is, how to set it up, the works. This section is meant to point you to the right places and get you running. You may take a look at this blog post to get a more elaborate version of this.
Hadoop consists of four kinds of services -
- NameNode: Stores metadata for the HDFS. This runs on the master node.
- DataNode: These services store and retrieve the actual data in the HDFS. This service is run on the slave nodes.
- JobTracker: This service runs on the master node; it coordinates and schedules jobs and distributes them on the TaskTracker nodes.
- TaskTracker: Runs the tasks, performs computations and communicates its progress with the JobTracker nodes.
Assuming there is only one node with multiple cores (say, 4 cores) in this scenario, here's how to proceed:
Disable iptables if it's enabled, or open up the Hadoop-specific ports.
Add a user for hadoop and its own directory:
sudo useradd -m hadoop sudo mkdir /hadoop sudo chown -R hadoop:hadoop /hadoop
Download the CDH3 Hadoop tarball from Cloudera and set it up:
sudo su hadoop cd /hadoop wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u6.tar.gz tar xzf hadoop-0.20.2-cdh3u6.tar.gz mv hadoop*/* ./ rm *.tar.gz
Enable passwordless SSH authentication for the hadoop user for localhost, so that you can ssh into localhost from the shell without requiring a password. Remember to chmod the contents of ~hadoop/.ssh/ to 600.
Edit /hadoop/conf/masters
and /hadoop/conf/slaves
so that both contain the word "localhost" without quotes.
Properly configure the JAVA_HOME variable in /hadoop/conf/hadoop-env.sh
. Next, modify the *-site.xml files in the /hadoop/conf directory and add these:
/hadoop/conf/core-site.xml:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/tmp</value> <!-- By default it is /tmp. Change it to wherever you can find enough space. --> </property> <property> <name>fs.default.name</name> <value>hdfs://node1:54310</value> </property> </configuration>
/hadoop/conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> <!-- Read up on this in Hadoop's docs. Basically this means how many nodes to replicate the HDFS data too. You may omit this property if you are using a single node cluster. --> <final>true</final> </property> <property> <name>dfs.permission</name> <value>false</value> <final>true</final> </property> </configuration>
/hadoop/conf/mapred-site.xml:
<configuration> <property> <name>mapred.reduce.tasks</name> <value>4</value> <!-- Run 4 simultaneous reduce tasks for a job --> </property> <property> <name>mapred.job.reuse.jvm.num.tasks</name> <value>-1</value> <!-- Reuse a JVM for further mappers and reducers rather than spawning a new one. --> </property> <property> <name>mapred.map.tasks</name> <value>4</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> <!-- Max number of mappers to run on a node --> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>4</value> <!-- Max number of mappers to run. --> </property> <property> <name>mapred.job.tracker</name> <value>hdfs://node1:54311</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx2048m</value> <!-- Read up on its details on hadoop's docs. I used this value on the 8G RAM labs instance for the Entity Suggester. --> </property> <property> <name>mapred.child.ulimit</name> <value>5012m</value> <!-- Read up on its details on hadoop's docs. I used this value on the 8G RAM labs instance for the Entity Suggester. --> </property> </configuration>
Configuring done. Let’s format the namenode and fire up the cluster:
sudo su hadoop cd /hadoop bin/hadoop namenode -format # Start the namenode and datanodes bin/start-dfs.sh # Check if the DFS has been properly started: bin/hadoop dfsadmin -report bin/hadoop dfs -df # Start the jobtracker and tasktrackers bin/start-mapred.sh
Check the log files. They are invaluable. Also, after starting up the services, check all the nodes with ps aux | grep java
to see if all the services are running correctly. Congratulations, by now you should have your own Hadoop cluster running and kicking. Do bin/stop-all.sh
to shut down the cluster.
Lastly, I would recommend Michael Noll’s tutorials on setting up single-node and multi-node clusters with Ubuntu. It does not use this cloudera distribution, but it’s a pretty good resource.