GROUP BY clause

The GROUP BY clause of the SELECT statement defines groups of rows which are aggregated (collected together) so that aggregate functions (such as COUNT, AVG, and SUM can be applied to the groups. GROUP BY groups a result into subsets that have matching values for one or more columns. In other words, the GROUP BY clause returns a single summary row of information for each group of rows that have identical values in columns to which GROUP BY has been applied.

In a streaming context, one of the columns to which GROUP BY is applied must be monotonically increasing. See Syntax for Streaming GROUP BY and What Counts as Monotonically Increasing? below. Often, though not always, this column will be ROWTIME with a FLOOR, CEILING, or STEP function applied, so that rows are first grouped by a time period (such as ten seconds, five minutes, or twelve hours), then by another column with limited values, such as “departments” or “regions.”

In s-Server, data is implicitly ordered by source ROWTIME, and physical order is maintained within ROWTIME.

We recommend using the STEP function with GROUP BY, which lets you “step” through timed intervals. See Using the STEP Function with GROUP BY below.

You can also use the FLOOR or CEILING functions with GROUP BY. See Using the FLOOR or CEILING Functions with GROUP BY below.

You can also use FIRST_VALUE and LAST_VALUE with GROUP BY. This is an extension to the SQL standard, which relies on the implicit ordering of data by source ROWTIME (and physical order within ROWTIME). You can use IGNORE NULLS with FIRST_VALUE and LAST_VALUE. See Using FIRST_VALUE and LAST_VALUE with GROUP BY below.

You should apply GROUP BY to the smallest unit of time you want grouped, and to that unit only. Grouping by both a larger and smaller unit is known as “nesting,” and does not produce useful results. See Avoiding Nested Groups below.

Syntax for Streaming GROUP BY

The syntax for streaming GROUP BY is as follows:

GROUP BY <monotonically increasing or time-based expression>,<column name-or-expression,...>.

The following query, which computes the product counts per hour, uses the monotonically increasing expression FLOOR(Orders.ROWTIME TO HOUR). This type of query is known as a tumbling window.

SELECT STREAM FLOOR(Orders.ROWTIME TO HOUR) AS theHour, prodId, COUNT(*)
FROM Orders
GROUP BY FLOOR(Orders.ROWTIME TO HOUR), prodId;

GROUP BY FLOOR(S.ROWTIME) TO HOUR will yield one output row per hour for the previous hour’s input rows.

As shown here, GROUP BY can specify additional partitioning terms. For example, GROUP BY FLOOR(S.ROWTIME) TO HOUR, USERID will yield one output row per hour per USERID value.

Without a monotonic column or expression in the GROUP BY clause, s-Server will not allow the query. The following query, for example, will be rejected by s-Server’s validator:

SELECT STREAM prodId, COUNT(*)
FROM Orders
GROUP BY prodId

The query is intended to compute the number of orders for each product as a stream. However, since Orders is an infinite stream, SQLstream s-Server can never know that it has seen all orders for a given product, and therefore can never complete a particular row’s total and output a row. Rather than allow a query that can never emit a row, SQLstream s-Server’s validator rejects the query.

What Counts as Monotonically Increasing?

ROWTIME is defined as monotonically increasing, and can always be used in a streaming GROUP BY. If you know that an expression is monotonic, you can declare it so by using the MONOTONIC function. If the actual data are not monotonic, the resulting system behavior is indeterminate: results may not be as expected or desired. See the topic MONOTONIC function in this guide for more details.

Duplicate rowtimes can occur in a stream, and as long as the ROWTIME value is the same, the GROUP BY operation will keep accumulating rows. In order to emit a row, the ROWTIME value has to change at some point.

Using the FLOOR or CEILING Functions with GROUP BY

The examples above used the FLOOR function to establish the limit for the group. When called with a date, time, or timestamp expression, FLOOR returns the largest value equal to or smaller than the input subject to the precision specified by an indicated time unit, such as to MINUTE or to SECOND. CEILING, or CEIL, works similarly, but returns the smallest value equal to or larger than the input, subject to the precision specified by the time unit.

The following example applies the AVERAGE function to one-minute interval groups established by FLOOR, then groups rows within these one-minute intervals by the column “bearing.”

SELECT STREAM AVG("speed") AS "avg_speed","bearing"
FROM "StreamLab_Output_buses_demo"."guide_1_out_step_1" AS s
GROUP BY FLOOR(s.ROWTIME to MINUTE), "bearing";

This example first groups rows into one minute intervals, then groups them by the column “bearing.” When you apply a group by, s-Server emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.

Using the STEP Function with GROUP BY

The STEP function performs arithmetic similar to the FLOOR function, but lets you work through repeated “floors” as data flows–in “steps.” We recommend using this function because it it is more flexible than FLOOR or CEILING. You can use the STEP function in a tumbling window with a GROUP BY clause, such as the following:

SELECT STREAM partition_id, SUM(measure)
FROM s
GROUP BY STEP(s.ROWTIME BY INTERVAL '5' MINUTE), partition_id;

In the example above, all streaming data for stream s is grouped by partition_id every 5 minutes. In the example above, STEP function is monotonically increasing since its first parameter, s.ROWTIME, is monotonic. The aggregator correctly generates punctuations when STEP function is used in a GROUP BY clause.

For all rows with ROWTIME between ‘2016-01-01 07:30:00’ and ‘2016-01-01 07:34.59.999’, the aggregated result has a ROWTIME of

STEP(s.ROWTIME BY INTERVAL '5' MINUTE) + INTERVAL '5' MINUTE
= '2016-01-01 07:35:00'

In this case, when aggregated results are emitted, the aggregator also emits a punctuation of 5 minutes later since the next aggregation result will have a ROWTIME of at least ‘2016-01-01 07:40:00’

Example

The following code groups the column rowtime by ten-second intervals, then by the column “bearing,” performing a MAX aggregation on the column “speed”.

SELECT STREAM MAX("speed") AS "max_speed","bearing"
FROM "StreamLab_Output_buses_demo"."guide_1_out_step_1" AS s
GROUP BY STEP(s.ROWTIME by interval '10' second), "bearing";

The result are rows emitted every 10 seconds for each value in the column “bearing.” In this case, those values are “north” and “northeast.”

+-----------+-----------+
| max_speed | bearing |
+-----------+-----------+
| 99| north |
| 117| northeast |
| --- | --- |
| 93| north |
| 81| northeast |
| 78| north |
| 94| northeast |
| 107| northeast |
| --- | --- |
| 55| north |
+-----------------------+

Using FIRST_VALUE and LAST_VALUE with GROUP BY

When you aggregate rows using GROUP BY, you lose all columns other than the columns included in the aggregation. You can use FIRST_VALUE, LAST_VALUE, MIN, or MAX with GROUP BY to return meaningful values to these columns.

The example below uses LAST_VALUE to return the column “highway”. This example would be most useful if you know “highway” does not change frequently.

SELECT STREAM AVG("speed") AS "avg_speed",
LAST_VALUE("highway")  AS "highway"
FROM "StreamLab_Output_buses-data"."dashboard_pipe1_step_3" AS "input"
GROUP BY STEP(s.ROWTIME BY INTERVAL '15' MINUTE),;

You can use IGNORE NULLS to ensure that a value is returned for FIRST_VALUE or LAST_VALUE. For example, let’s say a weather station sends weather data that may include readings for temperature and pressure in a single row.

Each row (report) may not have non-null values for both temperature & pressure. Let’s say you want get the most recent report for the temperature as well as the pressure from a weather station at the top of every 15 minutes. You can use LAST_VALUE to get the temperature and pressure at the 15 minute mark, using code similar to the following:

   SELECT STREAM
       station_id,
       LAST_VALUE(temp) IGNORE NULLS AS temp,
       LAST_VALUE(pressure) IGNORE NULLS AS pressure
   FROM weather_data AS s
   GROUP BY STEP(s.ROWTIME BY INTERVAL '15' MINUTE), station_id;

Nested Groups

Nested groups return the more fine-grained of the two groupings, and should be avoided. When you group rowtime by both and hour, for example, the results are the same as if you group rowtime by minute. When there is more than one grouping key, the effect is to group by the intersection of the groups.