Installing Hbase / Hadoop on EC2 cluster Installing Hbase / Hadoop on EC2 cluster hadoop hadoop

Installing Hbase / Hadoop on EC2 cluster


hbase has a set of ec2 scripts which get you setup and ready to go very quickly. It lets you configure the number of zk servers as well as slave nodes, but I'm not sure in which versions they are available. I'm using 0.20.6. After setting up some of your S3/EC2 information, you can do things like:

/usr/local/hbase-0.20.6/contrib/ec2/bin/launch-hbase-cluster CLUSTERNAME SLAVES ZKSERVERS

to quickly start using the cluster. It's nice because it'll install LZO information for you, as well.

Here are some params from the environment file in the bin directory that might be useful (if you want a 20.6 AMI):

# The version of HBase to use.HBASE_VERSION=0.20.6# The version of Hadoop to use.HADOOP_VERSION=0.20.2# The Amazon S3 bucket where the HBase AMI is stored.# Change this value only if you are creating your own (private) AMI# so you can store it in a bucket you own.#S3_BUCKET=apache-hbase-imagesS3_BUCKET=720040977164# Enable public access web interfacesENABLE_WEB_PORTS=false# Extra packages# Allows you to add a private Yum repo and pull packages from it as your# instances boot up. Format is <repo-descriptor-URL> <pkg1> ... <pkgN># The repository descriptor will be fetched into /etc/yum/repos.d.EXTRA_PACKAGES=# Use only c1.xlarge unless you know what you are doingMASTER_INSTANCE_TYPE=${MASTER_INSTANCE_TYPE:-c1.xlarge}# Use only c1.xlarge unless you know what you are doingSLAVE_INSTANCE_TYPE=${SLAVE_INSTANCE_TYPE:-c1.xlarge}# Use only c1.medium unless you know what you are doingZOO_INSTANCE_TYPE=${ZOO_INSTANCE_TYPE:-c1.medium}

You also might need to set your java version if JAVA_HOME is not set in the ami (and I don't think it is). Newer versions of hbase are probably available in S3 buckets, just do a describe instances and grep for hadoop/hbase to narrow the results.


From what I heard, the easiest and fastest way to get hbase running on EC2 is using apache whirr.


Are you aware of Amazon Elastic MapReduce? It doesn't offer Hbase but it offers plain 'ol Hadoop, Hive and Pig (in fairly recent versions). Big win is that they don't start charging you until 90% of your nodes are up, downside is that there is a slight premium per hour over normal EC2.

If you really need/want to use HBase then you may be better off spinning something up yourself. See the following Cloudera blog post for a discussion of Hive and Hbase integration: http://www.cloudera.com/blog/2010/06/integrating-hive-and-hbase/