Integrating Files using VFS

(new in SQLstream s-Server version 7.2.1)

This topic contains the following subtopics:

Introduction

The File-VFS source plugin is based on the Apache Commons VFS library, which provides support for multiple sources and compression formats. Currently, SQLstream's FILE-VFS ECDA plugin supports:

  • Reading files from different locations such as Local File System, HDFS, S3 or SFTP.
  • Reading data in Gzip/compressed and uncompressed formats.

Foreign Stream Options for Reading from Files using VFS

Option Description Default
FILE_LOCATION (Mandatory). Represents the URI of target directory containing the files to be read - for example file:///tmp/newdirectory represents a directory on the local file system. For more information see URI Formats Supported by FILE-VFS.
FILENAME_PATTERN (Mandatory). A Java regular expression match filter selecting which files to be read from the FILE_LOCATION directory. When the SORT_FIELD option has the value TIME_IN_FILENAME or LEXICAL_IN_FILENAME the first capture group in the regex will be used to identify the sub-regex on which the files will be sorted.
Example:
buses_(\d{14})_[a-z]+.* - files matching this regex will be picked and the group \d{14} (14 digits) will be used to sort files.
FILE_TYPE The type of file you wish to read. Supported options:
  • none : for uncompressed files [Default]
  • gzip : for GZip compressed files
none
SORT_FIELD The file sorting mechanism you wish to use:
  • MODIFIED_FILE_TIME: Sort the files on the basis of the file’s modified time. If two files have the same modified time the files will be sorted lexicographically [Default]
  • TIME_IN_FILENAME: Sort the files on the basis of the time mentioned in the file name. The first group of the regex (as mentioned in FILENAME_PATTERN) will be used to isolate the timestamp in the filename and FILENAME_TIME_FORMAT will be used to specify the format of the timestamp. If two files have the same time the files will be sorted lexicographically
  • LEXICAL_IN_FILENAME: Sort the files on the basis of a substring in the file name. The first group of the regex (as mentioned in FILENAME_PATTERN) will be used to isolate the substring in the filename.
See file-based T-sort for more information.
MODIFIED_FILE_TIME
FILENAME_TIME_FORMAT The format of the timestamp in case the SORT_FIELD option is set to TIME_IN_FILENAME.
  • Note: This option is mandatory while using TIME_IN_FILENAME
STARTING_FILE_NAME_AND_OFFSET Reads will start with the given starting file name and its offset value when SORT_FIELD option is set to LEXICAL_IN_FILENAME. The default value EARLIEST means all files will be picked. For example:
  • abc.txt:20 will read from line 20 in file abc.txt
  • abc.txt will read from the start of file abc.txt
NOTE: Ensure the starting file exists when STARTING_FILE_NAME_AND_OFFSET option is used.
EARLIEST
STARTING_TIME The starting time in case the SORT_FIELD option is set to TIME_IN_FILENAME - in the format yyyy-MM-dd HH:mm:ss.SSS. 1752-09-14 00:00:00.000.
STARTING_POSITION The starting position value for file read in case the option SORT_FIELD is set to MODIFIED_FILE_TIME. The format for STARTING_POSITION is modified_time_epoch[:file_name[:offset]] - file_name and offset are optional.
For example,
  • 1609423921 will pick all files with modified time matching epoch 1609423921 or later
  • 1609423921:abc.txt will ignore modified time and will pick file abc.txt or later
  • 1609423921:abc.txt:10 will ignore modified time and will pick file abc.txt and start from line number 11 (line number starts from 0)
1752-09-14 00:00:00.000
INGRESS_DELAY_INTERVAL The period (in milliseconds) for file-based T-sort to wait for late files - applies only if SORT_FIELD is set to MODIFIED_FILE_TIME or TIME_IN_FILENAME. Valid values are:
-1disable file-based t-sort completely
0 Do not wait - late files will be dropped (the default)
>= 1wait for this interval in milliseconds
0
INGRESS_FILE_SCAN_WAIT The duration (in milliseconds) in which the reader thread checks the Queue for new files to be read. 2000.
IGNORE_CORRUPT_FILES In case of any fatal IOExceptions that may come up while reading from a file, should the VFS reader should ignore the file? If true, the corrupt file is skipped else the pump is crashed False
NUMBER_OF_BUFFERS The number of buffers the File-VFS plugin can use to read data before it needs to recycle the buffers. Increasing the number of buffers might increase the plugin’s performance at the cost of memory. 2

Depending on the settings of SORT_FIELD and INGRESS_DELAY_INTERVAL, the file-based T-sort feature may come into operation.

File system specific options

Option File System Default Description
vfs.hdfs.kerberos.enabled HDFS 'false' In order to read from HDFS with kerberos authentication, the value for this option should be 'true'
vfs.hdfs.kerberos.keytab HDFS no default value In order to read from HDFS with kerberos authentication, the user must provide the location of the keytab here as a string
vfs.hdfs.kerberos.principal HDFS no default value In order to read from HDFS with kerberos authentication, the user must provide the value of the principal here as a string
vfs.s3.useHttps S3 'true' In order to read from s3 or minio over HTTP (and not HTTPS) the value for this option should be 'false'

Provenance Columns

Option Data type Description
SQLSTREAM_PROV_FILE_SOURCE_FILE VARCHAR Adds the name of the file being read to the output columns
SQLSTREAM_PROV_TEXT_PARSE_LINE_NUMBER BIGINT Adds the current line number to the output columns. Line number starts from 0.
SQLSTREAM_PROV_FILE_MODIFIED_TIME BIGINT Adds the modified time of the file being read to the output columns.

URI Formats Supported by FILE-VFS

The file systems that are currently supported by this plugin, and their URI formats to be used in specifying the FILE_LOCATION option are as follows:

Local File System

For reading from the local file system, the URI format is file:// followed by the absolute path of the directory from which the files have to be read:

file:///absolute/path/to/directory
  • When the file:// prefix is used, the leading / of the absolute path makes a total of 3 forward slash characters.

HDFS

For reading files from HDFS, the URI format is:

hdfs://hostname[:port][absolute-path]

A sample URI is hdfs://sample-mst-02.cloud.in.com:8020/sample-absolute-path/data.

S3

For reading files from over S3, the URI format is:

s3://[Access-Key]:[Secret-Key]@Endpoint-uri-containing-bucket-name-and-region/prefix-for-objects

SFTP

For reading files from a remote file system over SFTP, the URI format is:

sftp://username[:password]@hostname[:port]relative-path 
  1. Password is optional - reading files over SFTP requires either a password in the URI or the .ssh/id_rsa file in the SQLstream user’s home directory (such as /home/sqlstream/) for authentication. In the second case the RSA private key is being used for authentication. Ensure that the user sqlstream has read permissions on .ssh/id_rsa.
  2. Port is optional; if port is not provided, port 22 will be used.
  3. The components of the URI must not contain HTML reserved characters. If they do, they should be percent-encoded as mentioned in the following table.
  4. The directory’s path (on the remote machine) is provided as the relative-path in the URL (relative to the remote user's home directory). To read from /root/inputsource with the username root, the relative path becomes /inputsource since /root is the root user’s home directory.

Encoding HTML reserved characters in the URL

The components of the URI must not contain any HTML reserved characters. In case they do, they should be percent-encoded as shown in the table below. Hence if your password is root\@1, it should be encoded as root%401.

Reserved character Encoded value
Line Feed %0A
Carriage Return %0D
space %20
# %23
$ %24
% %25
& %26
/ %2F
: %3A
; %3B
< %3C
> %3E
? %3F
@ %40
[ %5B
\ %5C
^ %5E
] %5D
` %60
{ %7B
| %7C
} %7D
~ %7E

Example

If:

    user = root,
    password = root@123,
    hostname = sample-mst-01.cloud.in.guavus.com
    path =  /root/inputsource

the URI becomes:

    sftp://root:root%40123@sample-mst-01.cloud.in.guavus.com/inputsource

Sample SQL

CREATE OR REPLACE FOREIGN STREAM "7s"."fs1"
(
    "start-time" VARCHAR(32),
    "end-time" VARCHAR(32),
    "duration" DECIMAL,
    "src-addr" VARCHAR(64),
    "dst-addr" VARCHAR(64),
    "protocol" VARCHAR(32),
    "src-port" INTEGER,
    "dst-port" INTEGER,
    "tcp-flags" VARCHAR(6),
    "packets" BIGINT,
    "bytes" BIGINT,
    "sensor" VARCHAR(64),
    "in-if" INTEGER,
    "out-if" INTEGER,
    "log-time" VARCHAR(32),
    "src-asn" BIGINT,
    "dst-asn" BIGINT,
    "flow-src" VARCHAR(32),
    "SQLSTREAM_PROV_FILE_SOURCE_FILE" VARCHAR(128),
    "SQLSTREAM_PROV_TEXT_PARSE_LINE_NUMBER" BIGINT,
    "SQLSTREAM_PROV_FILE_MODIFIED_TIME" BIGINT
)
    SERVER "FILE_VFS_SERVER"
OPTIONS (
    "CHARACTER_ENCODING" 'UTF-8',
    "FILE_LOCATION" '/home/sqlstream/input',
    "FILENAME_PATTERN" '.*(\d{14}).*\.csv',
    "SEPARATOR" ',',
    "ROW_SEPARATOR" u&'\000A',
    "PARSER" 'CSV',
    "FILE_TYPE" 'none',
    "SKIP_HEADER" 'false',
    "SORT_FIELD" 'MODIFIED_FILE_TIME',
    "STARTING_POSITION" '1600682400:SPGWPH5_wap_adult-http-edr0_2020-12-25-14-00.gz:300000',
    "INGRESS_DELAY_INTERVAL" '60000'
);