LOAD DATA (Files Load)

The LOAD DATA command allows you to write raw data into Indexima tables.

Syntax

Load Data from local disk

SQL

LOAD DATA [LOCAL] 
INPATH '<protocol>://<absolute_path_to_data>'
INTO TABLE <table_name> 
FORMAT CSV|ORC|PARQUET|JSON
SEPARATOR ',|;|\t' 
[SKIP <number_of_row_to_skip> ]
[LIMIT <limit_number>]
[PATTERN <file_pattern>]
[DPATTERN <folder_pattern>]
[PREFIX <prefix>]
[SUFFIX <suffix>]
[DATE_FORMAT <date_format>]
[TIMESTAMP_FORMAT <timestamp_format>]
[LOCALE <locale_identifier>]
[NOCHECK]

Parameters

Parameter name	Description	Possible values
LOCAL	If used, this keyword tells the Master node to read data from his own filesystem. When not specified, every node must be able to read the data located in <absolute_path_to_data> When specified, <protocol> is not mandatory.	Empty \| LOCAL
protocol	The type of datalake you want to load data from.	s3 \| adl \| gs \| hdfs
absolute_path_to_data	The absolute path to a folder containing all the files you want to load data from. If LOCAL is specified, only the Master node needs to have read access to this path.	Any absolute path is readable by the Indexima nodes.
table_name	The name of the Indexima table you want to load the data into.	The exact name of the Indexima table you want to load data into. The structure of the files must match the structure of the table.
number_of_row_to_skip	The number of rows to skip before starting the load. This is used to skip the first or the 2 first rows, which are usually header rows.	Any positive number.
limit_number	Maximum Number of rows that will be read from each file.	Any positive number.
file_pattern	Only files beginning with the defined pattern would be loaded. Must end with a star.
folder_pattern	load all files contained in <absolute_path_to_data>: Use "" (star) as the wilcard and "/" for subfolder DPATTERN '' : Any folder and sub-folder within <absolute_path_to_data> (this option is equivalent to skip the use of DPATTERN) DPATTERN '/folder1_sub' : any sub-folder within <absolute_path_to_data> with the name beginning with folder1_sub DPATTERN 'folder1/**': Any folder within <absolute_path_to_data>with the name beginning with folder1
prefix	see below in "Load partitions from Hive table".
Suffix	supports only CSV format, same concept as prefix applied for missing column at the end of each line.
DATE_FORMAT	Specify a format to parse the dates in input file. Eg: `DATE_FORMAT 'dd_MM_yyyy'`The pattern must comply to Java DateTimeFormatter (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)
TIMESTAMP_FORMAT	Specify a format to parse the timestamps in input file. Eg: `TIMESTAMP_FORMAT 'dd/MM/yyyy HH\|mm\|ss'` The pattern must comply to Java DateTimeFormatter (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)
LOCALE	The local identifier used to parse the input file. This identifier describe the expected format for decimal fields. Eg: `LOCALE 'fr'`
NoCheck	Without this option, INDEXIMA checks if loaded values are compatible with the datatype set when creating the table. INDEXIMA triggers errors when datatypes do not match. Specifying `NOCHECK` skips this test and improve dramatically the performance of the loading. This is a good practice to run once `LOAD DATA` with data check enable. Then `NOCHECK` can be set for recurrent data loading to speed up the load process.

Please do not confuse the LOCAL parameter and the LOCALE parameter.

Load Example

SQL

LOAD DATA 
INPATH 'mypath'
INTO TABLE myschema.mytable
FORMAT CSV
SEPARATOR ',' 
SKIP 1
LIMIT 10
PATTERN 'Mypattern*'
PREFIX 'myprefix'
DATE_FORMAT 'dd_MM_yyyy'
TIMESTAMP_FORMAT 'dd/MM/yyyy HH|mm|ss' 
NOCHECK

Load with comma as decimal separator in input file

lang-en.csv

TEXT

1;10.05;hello

lang-fr.csv

TEXT

2;12,98;salut

Commands

SQL

CREATE SCHEMA tmp;
CREATE TABLE tmp.t1(i int, d double, s string, main index idx_0(i));

LOAD DATA IN PATH '/home/csv/lang-en.csv' into table tmp.t1 format CSV SEPARATOR ';' LOCALE 'en';
LOAD DATA IN PATH '/home/csv/lang-fr.csv' into table tmp.t1 format CSV SEPARATOR ';' LOCALE 'fr';
COMMIT tmp.t1;
SELECT * FROM tmp.t1;

Result

TEXT

d		s
10.05	hello
12.98	salut

Load partitions from Hive table

If you want to load a specified set of partitions and not all partitions from a Hive table at once, then provide a path of the root partition and add PREFIX 'column_values' to specify the values for the partition columns.

SQL

LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/my_imported_data_orc/year=2018/month=12/day=10'
INTO TABLE my_test 
FORMAT orc 
PREFIX '2018,12,10,'

If the Hive table is in tab-separated format, the separator must be \t so the statement looks like PREFIX '2018\t12\t10\t'.

Load Example with a suffix

Add SUFFIX 'column_values' to specify the values that will finish the CSV line when there is fewer columns than expected (defined by the list of columns in the table creation).

SQL

LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/myORCtable'
INTO TABLE my_test 
FORMAT CSV
SUFFIX 'XX,YY'

Load Example with DPATTERN

Consider the following tree view

Here is a list of examples with the expected behavior.

SQL

-- All files within s3a://mybucket/dparttern_test/ --> Insert 6 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**'; 

--All files within any subfolders with a name beginning with folder1_sub --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/folder1_sub*';

--All files within the any subfolders with a name beginning with folder1_sub AND files names beginning with lines --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/folder1_sub*' PATTERN 'lines*';

--All files within any folders with THE name folder1 AND files names beginning with lines --> Insert 3 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN 'folder1/**' PATTERN 'lines*';

--All files within any folders with a name beginning with folder1 AND files names beginning with lines --> Insert 5 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN 'folder1*/**' PATTERN 'lines*';

--All files within any subfolders with a name containing with 'sub' --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/*sub*';