Aggregation Windows

With conventional databases, you execute a SELECT query, the query runs, and the result is all the rows in selected tables that match the restrictions of the SELECT statement. That works because tables are finite: a SELECT statement grabs whatever rows are available at the moment the query is run. If someone were to add rows to a table after you’ve run the query, those rows would not be part of the query’s results. In a conventional relational database application, calculations such as SUM, COUNT, or MAX or operations such as JOIN, GROUP BY, or PARTITION, can know all the data they need in order to produce a correct result.

But when you are querying streams, the result of a SELECT query continues to add more rows, because the sources feeding these rows are continually updating data. So in a streaming context, the SELECT statement cannot know in advance all the rows that will be part of the query’s result. That means that if you’re going to run calculations on streaming data, you will need some way to block off chunks of data to which you will apply calculations.

To solve this problem, you apply functions and operations over a subset of records called a window. Each window contains, at any given time, a subset of streaming rows defined by time, number of rows, or another numeric expression. Examples of windows include “every ten minutes” or “the last thirty seconds” or “every millimeter,” and so on.

We’re going to call the set of rows to which the analytic is applied the window frame. In the diagram below, the window frame, highlighted in blue, lasts from 0 seconds after 2:00 on February 2, 2022, to 10 seconds after 2:00 on the same date.

If you’re used to working with tables that have fixed number of rows, you can think of windows as a kind of “virtual” table–windows set up a fixed window so that you can run queries on them. Actually, they’re not really fixed windows–they evolve over time, and are sometimes open ended. But it’s still a rough idea of how windows work in streaming SQL.

Time-based aggregation windows use rows’ rowtime in order to determine windows. Many windows end at current row, which is the latest rowtime that s-Server has seen.

Here are some examples of aggregation windows:

Windows That Use a Set Period of Time

A sliding window always applies an analytic to a set period of time, which moves along steadily. That is, a sliding window is always exactly 10 minutes of data (or, as we’ll see, 10 rows).

A hopping window also applies an analytic to a set period of time, but moves ahead in jumps, such that, for example, one minute is subtracted from the analytic window every minute.

Hopping windows are supported as of s-Server 5.3.

Windows That End Earlier Than the Present

Offset windows also use a set period of time–actually, these can be sliding or hopping–but the time period ends at some time earlier than the present–such as “10 seconds ago” or “1 minute ago.” These say to s-Server “give me all the rows for a ten minute interval that ends 1 minute ago.”

Offset windows are supported as of s-Server 5.3.

Windows that Start Calculating at Time X, and Keep Calculating.

An unbounded window calculates from first available data until the current row. A snapping window applies an analytic to a time period that began one hour ago, and starts over at the top of the hour.

Tumbling Time Windows

Tumbling windows accumulate a batch of rows of the input to produce a set of aggregated result rows. Technically, they are not “windows”–they’re the result of a GROUP BY statement run on rowtime.

Further Modifications for Windows

Distinct Values Only

For all of these windows, you can specify DISTINCT using SELECT STREAM DISTINCT.

Using Partitions

You can also use PARTITION BY to segment results. The window function is applied to each partition separately and computation restarts for each partition.

The Special Way s-Server Handles Current Row

Many windows end with the present, which is what we call current row.

There are some slightly tricky things about current row. Mostly, you need to know that s-Server thinks of current row as “the first time we see a new rowtime,” and starts making calculations accordingly, The tricky thing is that other rows may come in with the same rowtime. These will be incorporated into later calculations, but calculations begin as soon as the first row hits s-Server. That’s a good thing if you want your calculation to tell you when the temperature of a set of sensors exceeds an average–so that you can shut down whatever is causing the temperature to go up before things light on fire. But it can be a little weird to imagine that windows that involve current row incorporate rows that have the current rowtime, but not all of them.