File-based T-Sort

(new in SQLstream s-Server version 7.2.1)

File-based T-Sort is a feature of the File-VFS and S3 source plugins that mimics the T-Sort mechanism in SQLstream for files instead of rows. It relies on a priority queue to sort files by implied timestamp as described below.

This topic contains the following subtopics:

Terminology Used

Term Description
comparison_tuple depending on the value of SORT_FIELD:
MODIFIED_FILE_TIME: (last modified time of the file, filename) or
TIME_IN_FILENAME: the time extracted from the filename : (timestamp from filename, filename)
Thead The comparison tuple of the file currently at the head of the queue
Ttail The comparison tuple of the file currently at the tail of the queue
Tread The comparison tuple of the last file popped from the queue. i.e. the file that was the last read, or is currently being read from
Tnew The comparison tuple of the new file that is a candidate to be added to the queue
  • The INGRESS_DELAY_INTERVAL option defines a value (in milliseconds) denoting the window of the T-Sort to be applied.
    • A value greater than 0 implies that both the conditions for pushing and popping must be honoured, and the specified delay interval allows late-arriving files to be processed.
    • A value of 0 means files are processed in timestamp order, obeying the same rules; but as the delay interval is zero as soon as one file has been processed, the next will be popped from the queue. That means “late” files from one period have to arrive and be queued before any files from a subsequent period start being processed.
      - A value of -1 completely disables the T-Sort mechanism. Files are simply processed in order of arrival, regardless of timestamp or name.

T-Sort Mechanism

The T-Sort mechanism when INGRESS_DELAY_INTERVAL >= 0 is as follows:

  • A priority queue is used, ordered by timestamp (if used) and then filename.

Adding or Pushing a File to the Queue

Before adding a file to the queue check that the new file should not precede (have an earlier timestamp/filename than) the current file being read.

  • If (Tnew < Tread) drop the new file as it is definitely too late to be used; its timestamp is earlier than the one being currently processed
  • If (Tnew >= Tread) add to the queue (this file is certainly not late)
  • (Tnew = Tread) is impossible; even if the timestamps are the same, the names must be different

Popping a File from the Queue

The first file at the head of the queue will be processed only if there exists a file with a comparison_tuple (time component) >= to the comparison_tuple (time component) of the first file plus the INGRESS_DELAY_INTERVAL.

  • If (Ttail - Thead >= INGRESS_DELAY_INTERVAL), then pop the queue and read the new file. Else wait INGRESS_FILE_SCAN_WAIT milliseconds before testing if a new file is added to the queue, and if so re-check the delay interval.

This system ensures that files are processed in the correct order and that any late-arriving files are either sorted into the right order (if they arrive within the delay interval) or dropped (if they are too late).

Example of files arriving out of order

Below is a simple example to show how File-based T-sort works. Let us assume that the INGRESS_DELAY_INTERVAL for this use case is 10 minutes and for a file file_0000.csv the value of the tuple T will be (00:00,file_0000.csv).

Prior queue state (head, ... , tail) Prior delay interval (minutes) New file New file added to queue? Is a file read (popped) from queue?
<empty> 0 file_0000.csv yes
file_0000.csv 0 file_0004.csv yes
file_0000.csv, file_0004.csv 4 file_0008.csv yes
file_0000.csv, file_0004.csv, file_0008.csv 8 file_0006.csv (late file) yes, in time order
file_0000.csv, file_0004.csv, file_0006.csv, file_0008.csv 8 file_0012.csv yes
file_0000.csv, file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv 12 file_0000.csv can now be popped as soon as file_0012.csv has arrived
file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv 8 file_0002.csv (late file) yes, in time order, at head of queue
file_0002.csv, file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv 10 as soon as file_0002.csv is added to the queue, the delay interval becomes 10 minutes and we can pop the head of the queue - file_0002.csv
file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv 8 file_0001.csv (late file) no - file_0001.csv is rejected as its timestamp is earlier than the most recently read file_0002.csv