(new in SQLstream s-Server version 7.2.1)
File-based T-Sort is a feature of the File-VFS and S3 source plugins that mimics the T-Sort mechanism in SQLstream for files instead of rows. It relies on a priority queue to sort files by implied timestamp as described below.
This topic contains the following subtopics:
|comparison_tuple||depending on the value of SORT_FIELD:
MODIFIED_FILE_TIME: (last modified time of the file, filename) or
TIME_IN_FILENAME: the time extracted from the filename : (timestamp from filename, filename)
|Thead||The comparison tuple of the file currently at the head of the queue|
|Ttail||The comparison tuple of the file currently at the tail of the queue|
|Tread||The comparison tuple of the last file popped from the queue. i.e. the file that was the last read, or is currently being read from|
|Tnew||The comparison tuple of the new file that is a candidate to be added to the queue|
The T-Sort mechanism when INGRESS_DELAY_INTERVAL >= 0 is as follows:
Before adding a file to the queue check that the new file should not precede (have an earlier timestamp/filename than) the current file being read.
The first file at the head of the queue will be processed only if there exists a file with a comparison_tuple (time component) >= to the comparison_tuple (time component) of the first file plus the INGRESS_DELAY_INTERVAL.
This system ensures that files are processed in the correct order and that any late-arriving files are either sorted into the right order (if they arrive within the delay interval) or dropped (if they are too late).
Below is a simple example to show how File-based T-sort works. Let us assume that the INGRESS_DELAY_INTERVAL for this use case is 10 minutes and for a file
file_0000.csv the value of the tuple T will be (00:00,file_0000.csv).
|Prior queue state (head, ... , tail)||Prior delay interval (minutes)||New file||New file added to queue?||Is a file read (popped) from queue?|
|file_0000.csv, file_0004.csv, file_0008.csv||8||file_0006.csv (late file)||yes, in time order|
|file_0000.csv, file_0004.csv, file_0006.csv, file_0008.csv||8||file_0012.csv||yes|
|file_0000.csv, file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv||12||file_0000.csv can now be popped as soon as file_0012.csv has arrived|
|file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv||8||file_0002.csv (late file)||yes, in time order, at head of queue|
|file_0002.csv, file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv||10||as soon as file_0002.csv is added to the queue, the delay interval becomes 10 minutes and we can pop the head of the queue - file_0002.csv|
|file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv||8||file_0001.csv (late file)||no - file_0001.csv is rejected as its timestamp is earlier than the most recently read file_0002.csv|