Monitor an Indexima Cluster
In order to deliver maximum performance to users, it is recommended to monitor an Indexima cluster. This will allow anticipating some trouble before actual users experienced it.
it is recommended to perform those actions using a dedicated user. This will facilitate the log reading.
Live Monitoring
Overall Health status
Check Node Status every X minutes
Indexima provides a Cluster status that allows checking the health of each node. Each node should return at least the following results:
- "status": RUNNING
- "attached": TRUE
Rule
If there is at least a node that is not providing a running & attached status, this means the cluster is not working on a nominal case.
According to the High-Availability feature, Indexima will adapt. However, if this phenomenon is not scheduled, there is some analysis to perform.
Send Show Index & Show Dictionaries every X minutes
Use the Indexes & Dictionaries API to execute SHOW MEMORY & SHOW DICTIONARIES.
Rule
If the recurring operations last more than the average time (+ margin), it is recommended to restart the cluster.
Check Memory low events during the past X minutes
Use the Events History on EACH node to catch the Memory low events.
Memory Low events are normal events: when there is not enough memory to answer queries or to load data, the cluster will unload indexes to free space. In a very few cases, despite the fact that the system freed space for a certain amount of time, there is not enough space, and the system can't answer anymore.
Rule
If there are Memory low events during more than 15 min, it is recommended to restart the cluster.Data Analysts usage
Queries performance
Use the Queries History to check the performance of the past SELECT queries.
Rule
If the average response time (over a period) reaches a threshold, users are experiencing some slowness.
Queries Errors
Use the Queries History to check errors in the past SELECT queries.
Rule
If there are too many TimeOut errors, users are experiencing some slowness.
[Optional] Send Queries on critical tables
Define a list of critical tables. Send a list of queries on some critical tables.
Rule
If the average response time (over a period) reaches a threshold, and there are no writing operations on the specific tables, users are experiencing some slowness.
[Optional] Send Queries on a dummy table every Y minutes
Objective: Sends a set of queries that will reproduce data-analysts usage.
An example is provided here (*). This example has to be split into 2 parts
- Part1: Create & load the data in order to set up the dummy data. This part has to be done once.
- Part2: 2 select Queries to be sent
Rule
If the recurring operations last more than the average time (+ margin), it is recommended to restart the cluster.
Take into consideration that if the query response time is more than 1sec (default value), the query gets cached, thus the response time is 1 or 2 ms.
Checking on dummy tables will not ensure that the production tables are usable or not.
(*)This example is using data from the Indexima bench. Refer to this page to get the data & more information.
Off-line Monitoring
Check memory size used for indexes
Use Indexes & Dictionaries API to check the Memory size of indexes.
Administrators may aggregate the results by functional domain and check the size with the maximum limit that has been shared.
Check queries performance
Use the page Get Queries History to check the queries performance.