Setup shared storage
Indexima always requires shared storage, available in read & write to all Indexima nodes of a cluster. This storage is used to store the warehouse (with hyperindexes dump and ingested data) and the logs history shared between nodes.
This guide will show you how to configure this shared storage with various compatible file systems.
The 3 parameters in galactica.conf that need to target a shared storage are:
warehouse
history.dir
history.export
Any storage can be used with any deployment type (Yarn, standalone, Kubernetes).
HDFS
Indexima can use HDFS storage. Customize your galactica.conf as in the following sample:
galactica.conf
warehouse = hdfs://hdfs_server/indexima/warehouse
history.dir = hdfs://hdfs_server/indexima/log/history
history.export = hdfs://hdfs_server/indexima/log/history-export
pages.oneFilePerColumn = false
pages = 16000
When using HDFS, it is highly recommended to set pages.oneFilePerColumn to false. During the commit phases, Indexima will be very intensive I/O-wise.
Amazon S3
Indexima can use Amazon S3 storage. Customize your galactica.conf as in the following sample:
galactica.conf
warehouse = s3a://bucket-name/indexima/warehouse
history.dir = s3a://bucket-name/indexima/log/history
history.export = s3a://bucket-name/indexima/log/history-export
Note that the prefix is 's3a', as defined by the hadoop-aws module.
In order to authorize access to the S3 storage, please follow the hadoop-aws documentation available here. For standard setup, you will just need to add the AWS environment variable in your galactica-env.sh:
galactica-env.sh
export AWS_ACCESS_KEY_ID=<access_key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>
Azure ADLS Gen2
Indexima can use Azure ADLS Gen2 Blob storage. Customize your galactica.conf as in the following sample:
galactica.conf
warehouse = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/warehouse
history.dir = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/log/history
history.export = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/log/history-export
Add the following variable to galactice-env.sh:
galactica-env.sh
export HADOOP_OPTIONAL_TOOLS=hadoop-azure
Hadoop >=3.3.1 is required for native support of AzureBlobFileSystem.
In order to provide access to your azure storage, add a file core-site.xml to the Indexima galactica/conf folder. See the hadoop documentation for Azure Blob storage for various authentication setup.
Sample configuration for core-site.xml :
core-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.abfs.impl</name>
<value>org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem</value>
</property>
<property>
<name>fs.azure.account.auth.type.YOUR_STORAGE_ACCOUNT.dfs.core.windows.net</name>
<value>SharedKey</value>
</property>
<property>
<name>fs.azure.account.key.YOUR_STORAGE_ACCOUNT.dfs.core.windows.net</name>
<value>YOUR_ACCESS_KEY</value>
</property>
</configuration>
Azure ADLS Gen1
Indexima can use Azure ADLS Gen1 storage. Customize your galactica.conf as in the following sample:
galactica.conf
warehouse = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/warehouse
history.dir = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/log/history
history.export = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/log/history-export
Add the following variable to galactice-env.sh:
galactica-env.sh
export HADOOP_OPTIONAL_TOOLS=hadoop-azure
Hadoop >=3.1.4 is advised for ADLS integration, as Hadoop 2 doesn't embed hadoop-azure module.
For Hadoop2, please add manually hadoop-azure.jar and its dependencies to the share/hadoop/common folder.
In order to provide access to your azure storage, add a file core-site.xml to the Indexima galactica/conf folder. See the hadoop documentation for ADLS Gen1 for various authentication setup.
Sample configuration for core-site.xml :
core-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.adl.oauth2.client.id</name>
<value><YOUR_AZURE_APPLICATION_CLIENT_ID></value>
</property>
<property>
<name>fs.adl.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>fs.adl.oauth2.refresh.url</name>
<value>https://login.microsoftonline.com/<YOUR_AZURE_APPLICATION_TENANT_ID>/oauth2/token</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value><YOUR_AZURE_APPLICATION_SECRET></value>
</property>
<property>
<name>fs.defaultFS</name>
<value>adl://<YOUR_AZURE_DATALAKE_NAME>.azuredatalakestore.net</value>
</property>
<property>
<name>fs.adl.impl</name>
<value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.adl.impl</name>
<value>org.apache.hadoop.fs.adl.Adl</value>
</property>
</configuration>
GCP Cloud Storage
Indexima can use GCP Cloud Storage. Customize your galactica.conf as in the following sample:
galactica.conf
warehouse = gs://bucket-name/path/to/warehouse
You need to set Google Credentials to make this work. The only supported way is to use a credentials.json file generated with the GCP IAM service.
For more information about creating credentials for Google Storage access, refer to the official documentation for service-accounts and using-iam-permissions.
In order to use these credentials, the file must be present on every Indexima node. Then, add the following line to your galactica-env.sh file
galactica-env.sh
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
CEPH
Indexima can use CEPH Storage with various setups. The standard setup is to use a RADOS gateway to access CEPH with the Hadoop S3A client.
Please refer to the detailed documentation here.
The configuration of conf/galactica.conf needs to contain the warehouse.s3.* and s3.* parameters.
In order to perform some ORC or PARQUET load (using standard hadoop libraries), you also need to add a core-site.xml parameter file, with fs.s3a parameters.
Sample configuration for core-site.xml :
core-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>https://YOUR_CEPH_ENDPOINT</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
</property>
</configuration>
Local / NFS
Indexima can use a network filesystem mounted locally as shared storage. You need to mount the shared disk (NFS) so that each node can read and write into the warehouse. In order to do so, use the following configuration in the galactica.conf file.
galactica.conf
warehouse = /path/to/warehouse
In this example, all nodes must mount the disk to the same /path/to/warehouse path.