Integrating S3 Files

(new in SQLstream s-Server version 7.2.1)

This topic contains the following subtopics:

Introduction

This topic describes how to use the S3 source plugin to read compressed and un-compressed files from Amazon S3 or other in-cloud or on-premise S3 API compatible sources such as MinIO.

Foreign Stream Options for Reading from Files using VFS

Option Description Default
FILE_LOCATION (Mandatory). Represents the URI of target directory containing the files to be read - for example file:///tmp/newdirectory represents a directory on the local file system. For more information see URI Formats Supported by FILE-VFS.
FILENAME_PATTERN (Mandatory). A Java regular expression match filter selecting which files to be read from the FILE_LOCATION directory. When the SORT_FIELD option has the value TIME_IN_FILENAME or LEXICAL_IN_FILENAME the first capture group in the regex will be used to identify the sub-regex on which the files will be sorted.
Example:
buses_(\d{14})_[a-z]+.* - files matching this regex will be picked and the group \d{14} (14 digits) will be used to sort files.
FILE_TYPE The type of file you wish to read. Supported options:
  • none : for uncompressed files [Default]
  • gzip : for GZip compressed files
none
SORT_FIELD The file sorting mechanism you wish to use:
  • MODIFIED_FILE_TIME: Sort the files on the basis of the file’s modified time. If two files have the same modified time the files will be sorted lexicographically [Default]
  • TIME_IN_FILENAME: Sort the files on the basis of the time mentioned in the file name. The first group of the regex (as mentioned in FILENAME_PATTERN) will be used to isolate the timestamp in the filename and FILENAME_TIME_FORMAT will be used to specify the format of the timestamp. If two files have the same time the files will be sorted lexicographically
  • LEXICAL_IN_FILENAME: Sort the files on the basis of a substring in the file name. The first group of the regex (as mentioned in FILENAME_PATTERN) will be used to isolate the substring in the filename.
See file-based T-sort for more information.
MODIFIED_FILE_TIME
FILENAME_TIME_FORMAT The format of the timestamp in case the SORT_FIELD option is set to TIME_IN_FILENAME.
  • Note: This option is mandatory while using TIME_IN_FILENAME
STARTING_FILE_NAME_AND_OFFSET Reads will start with the given starting file name and its offset value when SORT_FIELD option is set to LEXICAL_IN_FILENAME. The default value EARLIEST means all files will be picked. For example:
  • abc.txt:20 will read from line 20 in file abc.txt
  • abc.txt will read from the start of file abc.txt
NOTE: Ensure the starting file exists when STARTING_FILE_NAME_AND_OFFSET option is used.
EARLIEST
STARTING_TIME The starting time in case the SORT_FIELD option is set to TIME_IN_FILENAME - in the format yyyy-MM-dd HH:mm:ss.SSS. 1752-09-14 00:00:00.000.
STARTING_POSITION The starting position value for file read in case the option SORT_FIELD is set to MODIFIED_FILE_TIME. The format for STARTING_POSITION is modified_time_epoch[:file_name[:offset]] - file_name and offset are optional.
For example,
  • 1609423921 will pick all files with modified time matching epoch 1609423921 or later
  • 1609423921:abc.txt will ignore modified time and will pick file abc.txt or later
  • 1609423921:abc.txt:10 will ignore modified time and will pick file abc.txt and start from line number 11 (line number starts from 0)
1752-09-14 00:00:00.000
INGRESS_DELAY_INTERVAL The period (in milliseconds) for file-based T-sort to wait for late files - applies only if SORT_FIELD is set to MODIFIED_FILE_TIME or TIME_IN_FILENAME. Valid values are:
-1disable file-based t-sort completely
0 Do not wait - late files will be dropped (the default)
>= 1wait for this interval in milliseconds
0
INGRESS_FILE_SCAN_WAIT The duration (in milliseconds) in which the reader thread checks the Queue for new files to be read. 2000.
IGNORE_CORRUPT_FILES In case of any fatal IOExceptions that may come up while reading from a file, should the VFS reader should ignore the file? If true, the corrupt file is skipped else the pump is crashed False
NUMBER_OF_BUFFERS The number of buffers the File-VFS plugin can use to read data before it needs to recycle the buffers. Increasing the number of buffers might increase the plugin’s performance at the cost of memory. 2

Depending on the settings of SORT_FIELD and INGRESS_DELAY_INTERVAL, the file-based T-sort feature may come into operation.

File system specific options

Option File System Default Description
vfs.hdfs.kerberos.enabled HDFS 'false' In order to read from HDFS with kerberos authentication, the value for this option should be 'true'
vfs.hdfs.kerberos.keytab HDFS no default value In order to read from HDFS with kerberos authentication, the user must provide the location of the keytab here as a string
vfs.hdfs.kerberos.principal HDFS no default value In order to read from HDFS with kerberos authentication, the user must provide the value of the principal here as a string
vfs.s3.useHttps S3 'true' In order to read from s3 or minio over HTTP (and not HTTPS) the value for this option should be 'false'

Provenance Columns

Option Data type Description
SQLSTREAM_PROV_FILE_SOURCE_FILE VARCHAR Adds the name of the file being read to the output columns
SQLSTREAM_PROV_TEXT_PARSE_LINE_NUMBER BIGINT Adds the current line number to the output columns. Line number starts from 0.
SQLSTREAM_PROV_FILE_MODIFIED_TIME BIGINT Adds the modified time of the file being read to the output columns.

FILE_LOCATION

FILE_LOCATION represents the URI of target bucket containing the files to be read. The following styles for providing the uri are currently supported and their URI formation is as follows:

s3://[Access-Key]:[Secret-Key]@Endpoint-uri-containing-bucket-name-and-region/prefix-for-objects

Given:

  • Access-Key: ABCEDABCEDABCEDABCED
  • Secret-Key: Sd9ADVCfHEIETyXSXU9niFn9/w8fMkN9lcZpUD+y
  • endpoint uri : test-01.s3.us-east-1.amazonaws.com
  • prefix (directory) : multi

The URI would be

s3://ABCEDABCEDABCEDABCED:Sd9ADVCfHEIETyXSXU9niFn9%2Fw8fMkN9lcZpUD%2By@test-01.s3.us-east-1.amazonaws.com/multi

Given:

  • Access-Key:ABCEDABCEDABCEDABCED
  • Secret-Key: Sd9ADVCfHEIETyXSXU9niFn9/w8fMkN9lcZpUD+y
  • endpoint uri : test-01.s3.amazonaws.com
  • Region : us-east-1
  • prefix (directory) : multi

The URI would be

s3://ABCEDABCEDABCEDABCED:Sd9ADVCfHEIETyXSXU9niFn9%2Fw8fMkN9lcZpUD%2By:us-east-1@test-01.s3.amazonaws.com/multi

NOTE: The endpoint URI must not use the http/https protocol

Encoding HTML reserved characters in the URL

The components of the URI must not contain any HTML reserved characters. In case they do, they should be percent-encoded as shown in the table below. Hence if your password is root\@1, it should be encoded as root%401.

Reserved character Encoded value
Line Feed %0A
Carriage Return %0D
space %20
# %23
$ %24
% %25
& %26
/ %2F
: %3A
; %3B
< %3C
> %3E
? %3F
@ %40
[ %5B
\ %5C
^ %5E
] %5D
` %60
{ %7B
| %7C
} %7D
~ %7E

Sample SQL

CREATE OR REPLACE FOREIGN STREAM "sample"."fs1"
(
    "start-time" VARCHAR(32),
    "end-time" VARCHAR(32),
    "duration" DECIMAL,
    "src-addr" VARCHAR(64),
    "dst-addr" VARCHAR(64),
    "protocol" VARCHAR(32),
    "src-port" INTEGER,
    "dst-port" INTEGER,
    "tcp-flags" VARCHAR(6),
    "packets" BIGINT,
    "bytes" BIGINT,
    "sensor" VARCHAR(64),
    "in-if" INTEGER,
    "out-if" INTEGER,
    "log-time" VARCHAR(32),
    "src-asn" BIGINT,
    "dst-asn" BIGINT,
    "flow-src" VARCHAR(32),
    "SQLSTREAM_PROV_FILE_SOURCE_FILE" VARCHAR(128),
    "SQLSTREAM_PROV_TEXT_PARSE_LINE_NUMBER" BIGINT,
    "SQLSTREAM_PROV_FILE_MODIFIED_TIME" BIGINT
)
    SERVER "FILE_VFS_SERVER"
OPTIONS (
    "CHARACTER_ENCODING" 'UTF-8',
    "FILE_LOCATION" 's3://ABCEDABCEDABCEDABCED:Sd9ADVCfHEIETyXSXU9niFn9%2Fw8fMkN9lcZpUD%2By@test-01.s3.us-east-1.amazonaws.com/multi',
    "FILENAME_PATTERN" '.*\.csv',
    "SEPARATOR" ',',
    "ROW_SEPARATOR" u&'\000A',
    "PARSER" 'CSV',
    "FILE_TYPE" 'none',
    "SKIP_HEADER" 'false',
    "SORT_FIELD" 'MODIFIED_FILE_TIME',
    "STARTING_POSITION" '1600682400:sample_2020-12-25-14-00.gz:300000',
    "INGRESS_DELAY_INTERVAL" '60000',
    "vfs.s3.useHttps" 'true'
);