Synchronize data from Hive datasource
Overview
When the INDEXIMA table is created with the following SQL query:
CREATE TABLE my_dataspace FROM my_hive_source IN 'jdbc:hive2://my_hive_server:10000' (INDEX (my_index));
INDEXIMA reads the table structure and creates a dataspace with the same or equivalent data types.
The following SQL command is used to detect differences between partitions in the HIVE source database and the INDEXIMA dataspace. This command will trigger a data load to keep both tables synchronized.
If a HIVE source partition is missing in Indexima, Indexima Data Engine automatically loads such partition into the appropriate table. The synchronize command includes a commit command:
SYNCHRONIZE my_data_space
Synchronize command is not auto-sync
Indexima doesn't follow/listen to changes in Hive Databases.
This means the SYNCHRONIZE commands must be sent after every modification in the Hive datasource that needs to be replicated in Indexima.
Synchronizing selected partitions from Hive source
Prerequisite: In your core-site.xml, the warehouse URI for the property fs.defaultFS must be the same warehouse as the Hive's one.
INDEXIMA allows synchronizing only selected partitions of a HIVE source created with a set of partition into an INDEXIMA dataspace
The HIVE dataspace/table must already be partitioned: it must be created with the keyword PARTITIONED BY as shown below:
Below are some examples, based on 3 columns partitions, namely year, month, day:
CREATE TABLE my_hiveTable (
id INT,
name STRING,
length STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
...;
- Check if new rows need to be synchronized in partitions corresponding to the date of 2nd April 2016:
SYNCHRONIZE my_data_space PARTITION (year='2016', month='4', day='2')
- Check if new rows need to be synchronized in partitions corresponding to the month of May 2016:
SYNCHRONIZE my_data_space PARTITION (year='2016', month='5')
- Check if new rows need to be synchronized in partitions corresponding to the year 2016:
SYNCHRONIZE my_data_space PARTITION (year='2016')
Synchronizing with transactional/ACID tables
If the Hive table is a transactional/ACID table, Indexima ignores all changes not committed.
The Hive base files will be synchronized with the base and delta files will be ignored.
NO_CLEANUP property
Synchronize commands in Hive temporarily generate 2 base directories.
In order to properly synchronize in Indexima, it is recommended to set NO_CLEANUP to false in the table properties as shown in the following command to enable the table-level cleanup and in order to clean obsolete directories/files.
ALTER TABLE <tablename> SET TBLPROPERTIES('NO_CLEANUP'=FALSE);
If the user wants to be sure there is only one base directory per partition, he can check with the following command.
hdfs dfs -ls -R /path/myschema.db/mytable | grep dr | grep base | wc -l
AND WAIT compaction option
In the case of partition modifications in HIVE, it is required to compact (with option=major) in order to get all modifications in the base file so that the Synchronize command in Indexima will work.
We recommend adding the option "AND WAIT" that blocks the operation until compaction completes.
See hive documentation for more details.
Synchronizing the Hive data source via Zookeeper
Hortonwork Hadoop ecosystem requires some additional jar libraries to be fully functional with Zookeeper:
Apply the following procedure to adapt INDEXIMA to such an environment:
Download the following jar to comply with Hortonworks ecosystem and unzip it: Hive thrift driver
Once started, you can synchronize a HIVE data source pointed by ZooKeeper.Stop your INDEXIMA CLUSTER
cd galactica ./stop.sh
Run multiple times the script stop.sh to ensure that there is no more connection
Remove 2 jars from galactica/lib directory
cd galactica/lib/ rm hive-service-thrift-1.2.1.*.jar rm hive-jdbc-1.2.1.jar
Copy the 2 new jars in galactica/lib directory
cd galactica/lib/ cp $HOME/hive-service-thrift-HW-2.6.5.jar . cp /usr/hdp/current/hive-client/lib/hive-jdbc.jar .
Restart your INDEXIMA CLUSTER
cd galactica nohup ./start-yarn.sh & yarn application -list
DROP TABLE IF EXISTS testzookeper;
CREATE DIMENSION TABLE testzookeeper FROM sourcehive
IN 'jdbc:hive2://ns3615.co:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' ();
synchronize testzookeeper;