Friday, 16 January 2015

Big Data SQL for Oracle NoSQL Database

Introduction


We recently announced Oracle Big Data SQL support for Oracle NoSQL Database. This feature will allow Oracle Database users to connect to external data repositories like Oracle NoSQL Database, and/or Hadoop, and will enable them to fetch data from any or all of the repositories (at once) through single SQL query. We believe this a significant step forward as it streamlines the adoption of big data technologies with traditional technology stack.

If you are following big data space then you would agree that everyone likes the scalability, availability, performance at cost aspect of it but almost everyone struggles to modify their enterprise application to work with big data technologies because they don't talk the same language as traditional database systems do.

For innovations to happen fast what is required is that we lower down the adoption barrier for the developers & the data scientists who have been at the core of this data revolution. And by providing same SQL declarative language for relational and non-relational data we have made it simpler for everyone to adopt these big data technologies.


SQL on Oracle NoSQL Database


Having an access layer using a declarative language is definitely a great thing but a common trend these days is that every NoSQL vendor is writing a custom SQLike layer. Now if you want to use your existing application to access relational data as well as non-relational data stored in the NoSQL database then you can not write a single SQL query to access these two separate stores, which again makes it complicated from application development and manageability point of view.

Oracle Big Data SQL for Oracle NoSQL Database fills this gap by providing exactly the same SQL access layer to both the stores and therefore the queries would look exactly the same no matter which store you access. To achieve this, Oracle Big Data SQL requires that Oracle NoSQL Database support the ability to access its data via Hive. Thus, plugging Oracle NoSQL Database to Oracle Database is as simple as performing 1 (Creating a NoSQL Store and a Table), 2  (Configuring Hive to access NoSQL Table) & 3 (Configuring Oracle Database to talk to HCatalog) steps.

Please note that Big Data SQL provides much better query performance over running a query directly on a generic external tables because with former approach, computation (for selecting, & formatting data) is moved where the data is as apposed to moving all the data to the Oracle Database node first and then applying the predicates. The same proprietary cell technology (developed for Exadata) is used for applying the query predicates in parallel to all the distributed NoSQL partitions, making the overall query to run faster.

In rest of the article I will give you the technical details on how to perform first two steps so you can get the hands-on experience of this feature.

Big Data SQL for Oracle NoSQL Database

Figure 1: Flow diagram showing steps to access data in Oracle NoSQL Database using hive queries.


Setup


Step 1: Create a TABLE in Oracle NoSQL Database and load some data that we would like to access from HIVE later on.
  • For this demonstration we will be using KVLITE but you can very well deploy a real cluster by following steps covered in this webinar. Please remember to set KVHOME as the environment variable that will point to the directory where you unzipped the latest Oracle NoSQL Database binaries  and KVROOT (eg. /u01/kvroot) where you would like metadata files to be stored.

$java -jar $KVHOME/lib/kvstore.jar kvlite -root $KVROOT -host localhost &
  • Once you have Oracle NoSQL Database cluster deployed (in our case it is single instance), connect to administration shell using runadmin command and then from the KV prompt connect to the store (kvstore in my case).
$java -jar $KVHOME/lib/kvstore.jar runadmin -port 5000 -host localhost
kv-> CONNECT STORE -name kvstore 
  • Create MOVIE table with few columns (to capture movie specific information)

kv->EXEC "CREATE TABLE MOVIE( \
             id INTEGER, \
             original_title STRING,\
             overview STRING,\
             poster_path STRING,\
             release_date STRING,\
             vote_count INTEGER,\
             runtime INTEGER,\
             popularity DOUBLE,\
             genres STRING,\
             PRIMARY KEY (id) )" 
  • SHOW table to confirm it is created successfully & DESCRIBE to make sure all the columns types are appropriate as well.

kv-> EXEC 'SHOW TABLES' 
kv-> EXEC 'DESC AS JSON TABLE MOVIE'
  • Next we will load some MOVIE data into the table. You can download a movie.dat file with details of about 10 movies represented as JSON string. We will directly load these strings into MOVIE table.

kv-> PUT TABLE -name MOVIE -file /tmp/movie.dat
Inserted 10 rows in movie table
  • You can now COUNT the rows and GET the rows to make sure data is indeed there.

kv-> AGGREGATE TABLE -name MOVIE -count
Row count: 10

KV-> GET TABLE -name MOVIE 


Step 2: Create Hive External Table using Storage Handler for Oracle NoSQL Database (that got shipped in release 3.2.5) .
  • Before we begin doing anything with Hive we first need to install Hadoop on our system. 
    • I have already installed Apache hadoop-2.2.0 and there is a good article that you can follow to do it yourself. 
  • After Hadoop being installed and tested, download. I am using apache-hive-0.14.0 at the time of this article but you can download any latest version. Unzip the archive and create environment variable HIVE_HOME to point to this directory.
  • Here are some of the environment variables that are set in my ~/.bashrc file: 

export HADOOP_HOME=/u01/home/hadoop/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export KVHOME=/u01/nosql/kvhome

export HIVE_HOME=/u01/home/hadoop/apache-hive-0.14.0
export HIVE_AUX_JARS_PATH=$KVHOME/lib/kvclient.jar

export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HIVE_HOME/bin:\
        $HADOOP_HOME/bin:$HADOOP_HOME
  • NOTICE that we have HIVE_AUX_JARS_PATH set to kvclient.jar. We need to set this variable to make sure Hive can find Storage Handler for Oracle NoSQL Database from the client driver itself.
  • Now the only thing that is left is creating an external table to mimic the table exist in Oracle NoSQL Database. Let's connect to hive first and then create the table using this DDL:

$hive

hive-> CREATE EXTERNAL TABLE IF NOT EXISTS 
         MOVIE( id INT, 
                original_title STRING,
                overview STRING, 
                poster_path STRING,
                release_date STRING, 
                vote_count INT, 
                runtime INT, 
                popularity DOUBLE, 
                genres STRING) 
         STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' 
         TBLPROPERTIES  ( "oracle.kv.kvstore"="kvstore",
                          "oracle.kv.hosts"="localhost:5000",
                          "oracle.kv.hadoop.hosts"="localhost",
                          "oracle.kv.tableName"="MOVIE");

Time taken: 0.174 seconds

Note: STORE BY field is where we define the StorageHandler class for Oracle NoSQL Database and TBLPROPERTIES is where we define name of the store, host name and port where storage node agent is listening at (in our case we have only node), and the name of the table in NoSQL DB. We have MOVIE as the table there so that is what we use as oracle.kv.tableName  field.
  • That is all one need to do to configure Hive to work with Oracle NoSQL Database. Pretty simple, right ? 

Step 2b: Now that we have NoSQL DB & Hive setup done, we just need to test running hive queries.
  • Here are some of the simple queries to fetch MOVIE data using Hive Query Language (HQL):

hive> SELECT id, original_title, release_date, genres 
        FROM MOVIE;

hive> SELECT id, original_title, release_date, genres 
        FROM MOVIE 
        WHERE release_date >1990;

hive> SELECT id, original_title, release_date, genres 
        FROM MOVIE 
        WHERE genres LIKE '%Comedy%'; 
  • Some other queries that would require M/R jobs to be spun off to get the results:

hive> SELECT count(*) 
        FROM MOVIE 
        WHERE genres LIKE '%Drama%'; 

hive> SELECT id, original_title, release_date, genres 
        FROM MOVIE 
        WHERE genres LIKE '%Drama%' 
        ORDER BY release_date;

hive> SELECT MAX(release_date) 
        FROM MOVIE;