Hadoop, Thrift, and C++ Hbase client example with installation and compilation tutorial

When I started to work with Hbase, I realized that there are no good examples and tutorials for C or C++ client. So I decided to show how to create and compile a working Hbase client which may become a wheelhorse for any project needed processing of very large data sets.

Contents

Quick reference

Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. More details – https://en.wikipedia.org/wiki/Apache_Hadoop

Thrift is an interface definition language and binary communication protocol that is used to define and create services for numerous languages. More details – https://en.wikipedia.org/wiki/Apache_Thrift

HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. More details – https://en.wikipedia.org/wiki/Apache_HBase

Installation

In this tutorial I’m working on Ubuntu Server 15.04 but for the other Unix-like systems installation would be quite similar.

Java installation

For the Hadoop, Thrift, and HBase, you are required to set the JAVA_HOME environment variable. So first you need to install Java. I will install prebuilt OpenJDK packages.

sudo apt-get install openjdk-7-jre

The preferred location for JAVA_HOME or any system variable is /etc/environment. Open /etc/environment in any text editor like nano or gedit and add the following.

JAVA_HOME="/usr/lib/jvm/open-jdk" (java path could be different)

Use source to load the variables, by running this command.

source /etc/environment

Then check the variable, by running this command.

echo $JAVA_HOME

Hadoop installation

Just download Hadoop from http://hadoop.apache.org/releases.html

wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
tar xzvf hadoop-2.7.0.tar.gz
$ cd hadoop-2.7.0/

Setup environment variables and alias.

export HADOOP_PREFIX="/home/alex/Programs/hadoop-2.7.0" # Change this to where you unpacked hadoop to.
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX

For a single-node installation lets change the main HDFS configuration file at $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml.

<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/alex/Programs/hadoop-2.7.0/hdfs/datanode</value>
        <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///home/alex/Programs/hadoop-2.7.0/hdfs/namenode</value>
        <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description>
    </property>
</configuration>

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost/</value>
        <description>NameNode URI</description>
    </property>
</configuration>

To configure YARN, the relevant file is $HADOOP_PREFIX/etc/hadoop/yarn-site.xml.

<configuration>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
        <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
        <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
        <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>2</value>
        <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
        <description>Physical memory, in MB, to be made available to running containers</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>4</value>
        <description>Number of CPU cores that can be allocated for containers.</description>
    </property>
</configuration>

Thrift installation

First install all the required tools and libraries to build and install the Thrift.

sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

Download and install the Thrift.

wget http://www.apache.org/dyn/closer.cgi?path=/thrift/0.9.2/thrift-0.9.2.tar.gz
tar xzvf thrift-0.9.2.tar.gz
$ cd thrift-0.9.2/
./configure
make
sudo make install

Hbase installation

Visit the http://archive.apache.org/dist/hbase/hbase-0.98.9/, choose your contributor, then choose stable version, and download file with “bin” in the name. Extract the downloaded file, and change to the newly-created directory.

wget http://archive.apache.org/dist/hbase/hbase-0.98.9/hbase-0.98.9-hadoop2-bin.tar.gz
tar xzvf hbase-0.98.9-hadoop2-bin.tar.gz
cd  hbase-0.98.9-hadoop2 

For Pseudo-Distributed HBase version edit conf/hbase-site.xml, which is the main HBase configuration file.

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:8020/hbase<</value>
  </property>
</configuration>

You do not need to create the HBase data directory. HBase will do this for you.

Starting

Let’s start all the services together.

Starting Hadoop

It’s time to setup the folders and start the daemons.

## Start HDFS daemons
# Format the namenode directory (DO THIS ONLY ONCE, THE FIRST TIME)
$HADOOP_PREFIX/bin/hdfs namenode -format
# Start the namenode daemon
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode
# Start the datanode daemon
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode

## Start YARN daemons
# Start the resourcemanager daemon
$HADOOP_PREFIX/sbin/yarn-daemon.sh start resourcemanager
# Start the nodemanager daemon
$HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager

Starting Thrift

hbase-0.98.9-hadoop2/bin/hbase-daemon.sh start thrift

Starting Hbase

The bin/start-hbase.sh script is provided as a convenient way to start HBase.

./hbase-0.98.9-hadoop2/bin/start-hbase.sh

If you want to play with Hbase, you could start the Hbase shell.

./hbase-0.98.9-hadoop2/bin/hbase shell

Gen the C++ code by the Thrift Server

Let’s make the gen cpp files.

thrift --gen cpp hbase-0.98.9-hadoop2/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift

Copy the gen cpp and lib files to your project directory, for example /var/www/hclient.

cp -R gen-cpp /var/www/hclient/gen-cpp
cp -R thrift-0.9.2/lib /var/www/hclient/lib - we copied all the folder in assuming using other languages in our project

C++ Hbase client code

/* Author: Alex Bod
* Website: http://www.alexbod.com
* License: The GNU General Public License, version 2
* main.cpp: C++ Hbase client using the Thrift Server
*/
#include <poll.h>
#include <iostream>
#include <string.h>
#include <vector>
#include <boost/lexical_cast.hpp>
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TBufferTransports.h>
#include <thrift/protocol/TBinaryProtocol.h>
#include "gen-cpp/Hbase.h"
using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;
using namespace apache::hadoop::hbase::thrift;
typedef std::vector<std::string> StrVec;
typedef std::map<std::string,std::string> StrMap;
typedef std::vector<ColumnDescriptor> ColVec;
typedef std::map<std::string,ColumnDescriptor> ColMap;
typedef std::vector<TCell> CellVec;
typedef std::map<std::string,TCell> CellMap;
/* The function to print rows */
static void printRow(const std::vector<TRowResult> &);
/* The function to print versions */
static void printVersions(const std::string &row, const CellVec &);
int main(int argc, char** argv)
{
/* Connection to the Thrift Server */
boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
/* Create the Hbase client */
HbaseClient client(protocol);
try {
transport->open();
std::string t("demo_table");
/* Scan all tables, look for the demo table and delete it. */
std::cout << "scanning tables..." << std::endl;
StrVec tables;
client.getTableNames(tables);
for (StrVec::const_iterator it = tables.begin(); it != tables.end(); ++it) {
std::cout << " found: " << *it << std::endl;
if (t == *it) {
if (client.isTableEnabled(*it)) {
std::cout << " disabling table: " << *it << std::endl;
client.disableTable(*it);
}
std::cout << " deleting table: " << *it << std::endl;
client.deleteTable(*it);
}
}
/* Create the demo table with two column families, entry: and unused: */
ColVec columns;
StrMap attr;
columns.push_back(ColumnDescriptor());
columns.back().name = "entry:";
columns.back().maxVersions = 10;
columns.push_back(ColumnDescriptor());
columns.back().name = "unused:";
std::cout << "creating table: " << t << std::endl;
try {
client.createTable(t, columns);
} catch (const AlreadyExists &ae) {
std::cerr << "WARN: " << ae.message << std::endl;
}
ColMap columnMap;
client.getColumnDescriptors(columnMap, t);
std::cout << "column families in " << t << ": " << std::endl;
for (ColMap::const_iterator it = columnMap.begin(); it != columnMap.end(); ++it) {
std::cout << " column: " << it->second.name << ", maxVer: " << it->second.maxVersions << std::endl;
}
/* Test UTF-8 handling */
std::string invalid("foo-\xfc\xa1\xa1\xa1\xa1\xa1");
std::string valid("foo-\xE7\x94\x9F\xE3\x83\x93\xE3\x83\xBC\xE3\x83\xAB");
/* Non-utf8 is fine for data */
std::vector<Mutation> mutations;
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = invalid;
client.mutateRow(t, "foo", mutations, attr);
/* Trying empty strings is not valid
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:";
mutations.back().value = "";
client.mutateRow(t, "", mutations, attr); */
/* This row name is valid utf8 */
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = valid;
client.mutateRow(t, valid, mutations, attr);
/* Non-utf8 is now allowed in row names because HBase stores values as binary */
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = invalid;
client.mutateRow(t, invalid, mutations, attr);
/* Run a scanner on the rows we just created */
StrVec columnNames;
columnNames.push_back("entry:");
std::cout << "Starting scanner..." << std::endl;
int scanner = client.scannerOpen(t, "", columnNames, attr);
try {
while (true) {
std::vector<TRowResult> value;
client.scannerGet(value, scanner);
if (value.size() == 0)
break;
printRow(value);
}
} catch (const IOError &ioe) {
std::cerr << "FATAL: Scanner raised IOError" << std::endl;
}
client.scannerClose(scanner);
std::cout << "Scanner finished" << std::endl;
/* Run some operations on a bunch of rows */
for (int i = 0; i <= 11; i++) {
/* Format row keys as "00000" to "00100" */
char buf[32];
sprintf(buf, "%05d", i);
std::string row(buf);
std::vector<TRowResult> rowResult;
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "unused:";
mutations.back().value = "DELETE_ME";
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
client.deleteAllRow(t, row, attr);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = "0";
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = "FOO";
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
/* Sleep to force later timestamp */
poll(0, 0, 50);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().isDelete = true;
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = "-1";
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = boost::lexical_cast<std::string>(i);
mutations.push_back(Mutation());
mutations.back().column = "entry:sqr";
mutations.back().value = boost::lexical_cast<std::string>(i*i);
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = "-999";
mutations.push_back(Mutation());
mutations.back().column = "entry:sqr";
mutations.back().isDelete = true;
client.mutateRowTs(t, row, mutations, 1, attr); /* Shouldn't override latest */
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
CellVec versions;
client.getVer(versions, t, row, "entry:num", 10, attr);
printVersions(row, versions);
assert(versions.size());
std::cout << std::endl;
try {
std::vector<TCell> value;
client.get(value, t, row, "entry:foo", attr);
if (value.size()) {
std::cerr << "FATAL: shouldn't get here!" << std::endl;
return -1;
}
} catch (const IOError &ioe) {
/* Blank */
}
}
/* Scan all rows/columns */
columnNames.clear();
client.getColumnDescriptors(columnMap, t);
std::cout << "The number of columns: " << columnMap.size() << std::endl;
for (ColMap::const_iterator it = columnMap.begin(); it != columnMap.end(); ++it) {
std::cout << " column with name: " + it->second.name << std::endl;
columnNames.push_back(it->second.name);
}
std::cout << std::endl;
std::cout << "Starting scanner..." << std::endl;
scanner = client.scannerOpenWithStop(t, "00020", "00040", columnNames, attr);
try {
while (true) {
std::vector<TRowResult> value;
client.scannerGet(value, scanner);
if (value.size() == 0)
break;
printRow(value);
}
} catch (const IOError &ioe) {
std::cerr << "FATAL: Scanner raised IOError" << std::endl;
}
client.scannerClose(scanner);
std::cout << "Scanner finished" << std::endl;
transport->close();
} catch (const TException &tx) {
std::cerr << "ERROR: " << tx.what() << std::endl;
}
}
/* The function to print rows */
static void printRow(const std::vector<TRowResult> &rowResult)
{
for (size_t i = 0; i < rowResult.size(); i++) {
std::cout << "row: " << rowResult[i].row << ", cols: ";
for (CellMap::const_iterator it = rowResult[i].columns.begin();it != rowResult[i].columns.end(); ++it) {
std::cout << it->first << " => " << it->second.value << "; ";
}
std::cout << std::endl;
}
}
/* The function to print versions */
static void printVersions(const std::string &row, const CellVec &versions)
{
std::cout << "row: " << row << ", values: ";
for (CellVec::const_iterator it = versions.begin(); it != versions.end(); ++it) {
std::cout << (*it).value << "; ";
}
std::cout << std::endl;
}

Download directly the tar archive or clone it from the Github.

Compilation

g++ -Wall -o hclient main.cpp gen-cpp/Hbase_types.cpp gen-cpp/Hbase_constants.cpp gen-cpp/Hbase.cpp -lthrift

Dependies explanation

gen-cpp/Hbase_types.cpp gen-cpp/Hbase_constants.cpp gen-cpp/Hbase.cpp: including the Hbase dependies
-lthrift: including the Thrift library

Run the Hbase client

./hclient