Skip to main content
Skip table of contents

LOAD DATA (Files Load)

The LOAD DATA command allows you to write raw data into Indexima tables.

Syntax

Load Data from local disk

SQL
LOAD DATA [LOCAL] 
INPATH '<protocol>://<absolute_path_to_data>'
INTO TABLE <table_name> 
FORMAT CSV|ORC|PARQUET|JSON
SEPARATOR ',|;|\t' 
[SKIP <number_of_row_to_skip> ]
[LIMIT <limit_number>]
[PATTERN <file_pattern>]
[DPATTERN <folder_pattern>]
[PREFIX <prefix>]
[SUFFIX <suffix>]
[DATE_FORMAT <date_format>]
[TIMESTAMP_FORMAT <timestamp_format>]
[LOCALE <locale_identifier>]
[NOCHECK]

Parameters

Parameter nameDescriptionPossible values

LOCAL

If used, this keyword tells the Master node to read data from his own filesystem.
When not specified, every node must be able to read the data located in <absolute_path_to_data>
When specified, <protocol> is not mandatory.

Empty | LOCAL

protocol

The type of datalake you want to load data from.s3 | adl | gs | hdfs
absolute_path_to_dataThe absolute path to a folder containing all the files you want to load data from. If LOCAL is specified, only the Master node needs to have read access to this path.Any absolute path is readable by the Indexima nodes.
table_nameThe name of the Indexima table you want to load the data into.The exact name of the Indexima table you want to load data into. The structure of the files must match the structure of the table.
number_of_row_to_skipThe number of rows to skip before starting the load. This is used to skip the first or the 2 first rows, which are usually header rows.Any positive number.
limit_numberMaximum Number of rows that will be read from each file.Any positive number.
file_patternOnly files beginning with the defined pattern would be loaded. Must end with a star.
folder_pattern

load all files contained in <absolute_path_to_data>: Use "*" (star) as the wilcard and "/" for subfolder

  • DPATTERN '**' : Any folder and sub-folder within <absolute_path_to_data> (this option is equivalent to skip the use of DPATTERN)
  • DPATTERN '**/folder1_sub*' : any sub-folder within <absolute_path_to_data> with the name beginning with folder1_sub
  • DPATTERN 'folder1/**': Any folder within <absolute_path_to_data>with the name beginning with folder1

prefixsee below in "Load partitions from Hive table".
Suffixsupports only CSV format, same concept as prefix applied for missing column at the end of each line.
DATE_FORMAT

Specify a format to parse the dates in input file.
Eg: DATE_FORMAT 'dd_MM_yyyy'
The pattern must comply to Java DateTimeFormatter (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)


TIMESTAMP_FORMATSpecify a format to parse the timestamps in input file.
Eg: TIMESTAMP_FORMAT 'dd/MM/yyyy HH|mm|ss' 
The pattern must comply to Java DateTimeFormatter (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)

LOCALEThe local identifier used to parse the input file. This identifier describe the expected format for decimal fields.
Eg: LOCALE 'fr' 

NoCheck

Without this option, INDEXIMA checks if loaded values are compatible with the datatype set when creating the table. INDEXIMA triggers errors when datatypes do not match.
Specifying NOCHECK skips this test and improve dramatically the performance of the loading. This is a good practice to run once LOAD DATA with data check enable. Then NOCHECK can be set for recurrent data loading to speed up the load process.


Please do not confuse the LOCAL parameter and the LOCALE parameter.

Load Example

SQL
LOAD DATA 
INPATH 'mypath'
INTO TABLE myschema.mytable
FORMAT CSV
SEPARATOR ',' 
SKIP 1
LIMIT 10
PATTERN 'Mypattern*'
PREFIX 'myprefix'
DATE_FORMAT 'dd_MM_yyyy'
TIMESTAMP_FORMAT 'dd/MM/yyyy HH|mm|ss' 
NOCHECK

Load with comma as decimal separator in input file

lang-en.csv

TEXT
1;10.05;hello

lang-fr.csv

TEXT
2;12,98;salut

Commands

SQL
CREATE SCHEMA tmp;
CREATE TABLE tmp.t1(i int, d double, s string, main index idx_0(i));

LOAD DATA IN PATH '/home/csv/lang-en.csv' into table tmp.t1 format CSV SEPARATOR ';' LOCALE 'en';
LOAD DATA IN PATH '/home/csv/lang-fr.csv' into table tmp.t1 format CSV SEPARATOR ';' LOCALE 'fr';
COMMIT tmp.t1;
SELECT * FROM tmp.t1;

Result

TEXT
d		s
10.05	hello
12.98	salut

Load partitions from Hive table

If you want to load a specified set of partitions and not all partitions from a Hive table at once, then provide a path of the root partition and add PREFIX 'column_values' to specify the values for the partition columns.

SQL
LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/my_imported_data_orc/year=2018/month=12/day=10'
INTO TABLE my_test 
FORMAT orc 
PREFIX '2018,12,10,'

If the Hive table is in tab-separated format, the separator must be \t so the statement looks like PREFIX '2018\t12\t10\t'.

Load Example with a suffix

Add SUFFIX 'column_values' to specify the values that will finish the CSV line when there is fewer columns than expected (defined by the list of columns in the table creation).

SQL
LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/myORCtable'
INTO TABLE my_test 
FORMAT CSV
SUFFIX 'XX,YY'

Load Example with DPATTERN

Consider the following tree view

Here is a list of examples with the expected behavior.

SQL
-- All files within s3a://mybucket/dparttern_test/ --> Insert 6 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**'; 

--All files within any subfolders with a name beginning with folder1_sub --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/folder1_sub*';

--All files within the any subfolders with a name beginning with folder1_sub AND files names beginning with lines --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/folder1_sub*' PATTERN 'lines*';

--All files within any folders with THE name folder1 AND files names beginning with lines --> Insert 3 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN 'folder1/**' PATTERN 'lines*';

--All files within any folders with a name beginning with folder1 AND files names beginning with lines --> Insert 5 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN 'folder1*/**' PATTERN 'lines*';

--All files within any subfolders with a name containing with 'sub' --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/*sub*';
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.