LOAD DATA (Files Load)
The LOAD DATA command allows you to write raw data into Indexima tables.
Syntax
Load Data from local disk
LOAD DATA [LOCAL]
INPATH '<protocol>://<absolute_path_to_data>'
INTO TABLE <table_name>
FORMAT CSV|ORC|PARQUET|JSON
SEPARATOR ',|;|\t'
[SKIP <number_of_row_to_skip> ]
[LIMIT <limit_number>]
[PATTERN <file_pattern>]
[DPATTERN <folder_pattern>]
[PREFIX <prefix>]
[SUFFIX <suffix>]
[NOCHECK]
Parameters
Parameter name | Description | Possible values |
---|---|---|
LOCAL | If used, this keyword tells the Master node to read data from his own filesystem. | Empty | LOCAL |
protocol | The type of datalake you want to load data from. | s3 | adl | gs | hdfs |
absolute_path_to_data | The absolute path to a folder containing all the files you want to load data from. If LOCAL is specified, only the Master node needs to have read access to this path. | Any absolute path is readable by the Indexima nodes. |
table_name | The name of the Indexima table you want to load the data into. | The exact name of the Indexima table you want to load data into. The structure of the files must match the structure of the table. |
number_of_row_to_skip | The number of rows to skip before starting the load. This is used to skip the first or the 2 first rows, which are usually header rows. | Any positive number. |
limit_number | Maximum Number of rows that will be read from each file. | Any positive number. |
file_pattern | Only files beginning with the defined pattern would be loaded. Must end with a star. | |
folder_pattern | load all files contained in <absolute_path_to_data>: Use "*" (star) as the wilcard and "/" for subfolder
| |
prefix | see below in "Load partitions from Hive table". | |
Suffix | supports only CSV format, same concept as prefix applied for missing column at the end of each line. | |
NoCheck | Without this option, INDEXIMA checks if loaded values are compatible with the datatype set when creating the table. INDEXIMA triggers errors when datatypes do not match. |
Load Example
Example
LOAD DATA
INPATH 'mypath'
INTO TABLE myschema.mytable
FORMAT CSV
SEPARATOR ','
SKIP 1
LIMIT 10
PATTERN 'Mypattern*'
PREFIX 'myprefix'
NOCHECK
Load partitions from Hive table
If you want to load a specified set of partitions and not all partitions from a Hive table at once, then provide a path of the root partition and add PREFIX 'column_values' to specify the values for the partition columns.
Load Data from JDBC
LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/my_imported_data_orc/year=2018/month=12/day=10'
INTO TABLE my_test
FORMAT orc
PREFIX '2018,12,10,'
If the Hive table is in tab-separated format, the separator must be \t so the statement looks like PREFIX '2018\t12\t10\t'.
Load Example with a suffix
Add SUFFIX 'column_values' to specify the values that will finish the CSV line when there is fewer columns than expected (defined by the list of columns in the table creation).
Load Data from JDBC using suffix
LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/myORCtable'
INTO TABLE my_test
FORMAT CSV
SUFFIX 'XX,YY'
Load Example with DPATTERN
Consider the following tree view
Here is a list of examples with the expected behavior.
DPATTERN examples
-- All files within s3a://mybucket/dparttern_test/ --> Insert 6 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**';
--All files within any subfolders with a name beginning with folder1_sub --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/folder1_sub*';
--All files within the any subfolders with a name beginning with folder1_sub AND files names beginning with lines --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/folder1_sub*' PATTERN 'lines*';
--All files within any folders with THE name folder1 AND files names beginning with lines --> Insert 3 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN 'folder1/**' PATTERN 'lines*';
--All files within any folders with a name beginning with folder1 AND files names beginning with lines --> Insert 5 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN 'folder1*/**' PATTERN 'lines*';
--All files within any subfolders with a name containing with 'sub' --> Insert 2 lines
LOAD DATA INPATH 's3a://mybucket/dparttern_test/' INTO TABLE tmp_sco.tbl1 FORMAT CSV SEPARATOR ',' DPATTERN '**/*sub*';