Synchronize data from Hive datasource

Overview

When the INDEXIMA table is created with the following SQL query:

SQL

CREATE TABLE my_dataspace FROM my_hive_source IN 'jdbc:hive2://my_hive_server:10000' (INDEX (my_index));

INDEXIMA reads the table structure and creates a dataspace with the same or equivalent data types.

The following SQL command is used to detect differences between partitions in the HIVE source database and the INDEXIMA dataspace. This command will trigger a data load to keep both tables synchronized.

If a HIVE source partition is missing in Indexima, Indexima Data Engine automatically loads such partition into the appropriate table. The synchronize command includes a commit command:

SQL

SYNCHRONIZE my_data_space

Synchronize command is not auto-sync

Indexima doesn't follow/listen to changes in Hive Databases.

This means the SYNCHRONIZE commands must be sent after every modification in the Hive datasource that needs to be replicated in Indexima.

Synchronizing selected partitions from Hive source

Prerequisite: In your core-site.xml, the warehouse URI for the property fs.defaultFS must be the same warehouse as the Hive's one.

INDEXIMA allows synchronizing only selected partitions of a HIVE source created with a set of partition into an INDEXIMA dataspace

The HIVE dataspace/table must already be partitioned: it must be created with the keyword PARTITIONED BY as shown below:

Below are some examples, based on 3 columns partitions, namely year, month, day:

SQL

CREATE TABLE my_hiveTable (
    id           INT,
    name         STRING,
    length         STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
...;

Check if new rows need to be synchronized in partitions corresponding to the date of 2nd April 2016:

SQL

SYNCHRONIZE my_data_space PARTITION (year='2016', month='4', day='2')

Check if new rows need to be synchronized in partitions corresponding to the month of May 2016:

SQL

SYNCHRONIZE my_data_space PARTITION (year='2016', month='5')

Check if new rows need to be synchronized in partitions corresponding to the year 2016:

SQL

SYNCHRONIZE my_data_space PARTITION (year='2016')

Synchronizing with transactional/ACID tables

If the Hive table is a transactional/ACID table, Indexima ignores all changes not committed.
The Hive base files will be synchronized with the base and delta files will be ignored.

NO_CLEANUP property

Synchronize commands in Hive temporarily generate 2 base directories.

In order to properly synchronize in Indexima, it is recommended to set NO_CLEANUP to false in the table properties as shown in the following command to enable the table-level cleanup and in order to clean obsolete directories/files.

CODE

ALTER TABLE <tablename> SET TBLPROPERTIES('NO_CLEANUP'=FALSE);

If the user wants to be sure there is only one base directory per partition, he can check with the following command.

hdfs dfs -ls -R /path/myschema.db/mytable | grep dr | grep base | wc -l

AND WAIT compaction option

In the case of partition modifications in HIVE, it is required to compact (with option=major) in order to get all modifications in the base file so that the Synchronize command in Indexima will work.

We recommend adding the option "AND WAIT" that blocks the operation until compaction completes.

See hive documentation for more details.

Synchronizing the Hive data source via Zookeeper

Hortonwork Hadoop ecosystem requires some additional jar libraries to be fully functional with Zookeeper:

Apply the following procedure to adapt INDEXIMA to such an environment:

Download the following jar to comply with Hortonworks ecosystem and unzip it: Hive thrift driver

Once started, you can synchronize a HIVE data source pointed by ZooKeeper.Stop your INDEXIMA CLUSTER
```
cd galactica
./stop.sh
```
Run multiple times the script stop.sh to ensure that there is no more connection

Remove 2 jars from galactica/lib directory

cd galactica/lib/
rm hive-service-thrift-1.2.1.*.jar
rm hive-jdbc-1.2.1.jar

Copy the 2 new jars in galactica/lib directory

cd galactica/lib/
cp $HOME/hive-service-thrift-HW-2.6.5.jar .
cp  /usr/hdp/current/hive-client/lib/hive-jdbc.jar .

Restart your INDEXIMA CLUSTER

cd galactica
nohup ./start-yarn.sh &
yarn application -list

SQL

DROP TABLE IF EXISTS testzookeper; 
CREATE DIMENSION TABLE testzookeeper FROM sourcehive 
IN 'jdbc:hive2://ns3615.co:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' ();
synchronize testzookeeper;