The GROUP BY clause of the SELECT statement defines groups of rows which are aggregated (collected together) so that aggregate functions (such as COUNT, AVG, and SUM can be applied to the groups. GROUP BY groups a result into subsets that have matching values for one or more columns. In other words, the GROUP BY clause returns a single summary row of information for each group of rows that have identical values in columns to which GROUP BY has been applied.
This topic includes the following sections:
The syntax for streaming GROUP BY is as follows:
GROUP BY <monotonically increasing or time-based expression> [,<column name-or-expression> ]*
The following query, which computes the product counts per hour, uses the monotonically increasing expression FLOOR(Orders.ROWTIME TO HOUR). This type of query is known as a tumbling window.
SELECT STREAM FLOOR(Orders.ROWTIME TO HOUR) AS theHour, prodId, COUNT(*) FROM Orders GROUP BY FLOOR(Orders.ROWTIME TO HOUR), prodId;
GROUP BY FLOOR(S.ROWTIME) TO HOUR will yield one output row per hour for the previous hour’s input rows.
As shown here, GROUP BY can specify additional partitioning terms. For example, GROUP BY FLOOR(S.ROWTIME) TO HOUR, USERID will yield one output row per hour per USERID value.
Without a monotonic column or expression in the GROUP BY clause, s-Server will not allow the query. The following query, for example, will be rejected by s-Server’s validator:
SELECT STREAM prodId, COUNT(*) FROM Orders GROUP BY prodId
The query is intended to compute the number of orders for each product as a stream. However, since Orders is an infinite stream, SQLstream s-Server can never know that it has seen all orders for a given product, and therefore can never complete a particular row’s total and output a row. Rather than allow a query that can never emit a row, SQLstream s-Server’s validator rejects the query.
In a streaming context, one of the columns to which GROUP BY is applied must be monotonically increasing. Often, though not always, this column will be ROWTIME with a FLOOR, CEILING, or STEP function applied, so that rows are first grouped by a time period (such as ten seconds, five minutes, or twelve hours), then by another column with limited values, such as “departments” or “regions”.
In s-Server, data is implicitly ordered by source ROWTIME, and physical order is maintained within ROWTIME.
You can also use the FLOOR or CEILING functions with GROUP BY. See Using the FLOOR or CEILING Functions with GROUP BY below.
ROWTIME is defined as monotonically increasing, and can always be used in a streaming GROUP BY. If you know that an expression is monotonic, you can declare it so by using the MONOTONIC function. If the actual data are not monotonic, the resulting system behavior is indeterminate: results may not be as expected or desired. See the topic MONOTONIC function in this guide for more details.
If you need to use a timestamp that is not monitonically increasing, please see Grouping by non-monotonic timestamp column below.
Duplicate rowtimes can occur in a stream, and as long as the ROWTIME value is the same, the GROUP BY operation will keep accumulating rows. In order to emit a row, the ROWTIME value has to change at some point.
The examples above used the FLOOR function to establish the limit for the group. When called with a date, time, or timestamp expression, FLOOR returns the largest value equal to or smaller than the input subject to the precision specified by an indicated time unit, such as to MINUTE or to SECOND. CEILING, or CEIL, works similarly, but returns the smallest value equal to or larger than the input, subject to the precision specified by the time unit.
The following example applies the AVERAGE function to one-minute interval groups established by FLOOR, then groups rows within these one-minute intervals by the column “bearing”.
SELECT STREAM AVG("speed") AS "avg_speed","bearing" FROM "StreamLab_Output_buses_demo"."guide_1_out_step_1" AS s GROUP BY FLOOR(s.ROWTIME to MINUTE), "bearing";
This example first groups rows into one minute intervals, then groups them by the column “bearing”. When you apply a group by, s-Server emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.
The STEP function performs arithmetic similar to the FLOOR function, but lets you work through repeated “floors” as data flows–in “steps”. We recommend using this function because it it is more flexible than FLOOR or CEILING. You can use the STEP function in a tumbling window with a GROUP BY clause, such as the following:
SELECT STREAM partition_id, SUM(measure) FROM s GROUP BY STEP(s.ROWTIME BY INTERVAL '5' MINUTE), partition_id;
In the example above, all streaming data for stream s is grouped by partition_id every 5 minutes. In the example above, STEP function is monotonically increasing since its first parameter, s.ROWTIME, is monotonic. The aggregator correctly generates punctuations when STEP function is used in a GROUP BY clause.
For all rows with ROWTIME between ‘2016-01-01 07:30:00’ and ‘2016-01-01 07:34.59.999’, the aggregated result has a ROWTIME of
STEP(s.ROWTIME BY INTERVAL '5' MINUTE) + INTERVAL '5' MINUTE = '2016-01-01 07:35:00'
In this case, when aggregated results are emitted, the aggregator also emits a punctuation of 5 minutes later since the next aggregation result will have a ROWTIME of at least ‘2016-01-01 07:40:00’
The following code groups the column rowtime by ten-second intervals, then by the column “bearing”, performing a MAX aggregation on the column “speed”.
SELECT STREAM MAX("speed") AS "max_speed","bearing" FROM "StreamLab_Output_buses_demo"."guide_1_out_step_1" AS s GROUP BY STEP(s.ROWTIME by interval '10' second), "bearing";
The result are rows emitted every 10 seconds for each value in the column “bearing”. In this case, those values are “north” and “northeast”.
+-----------+-----------+ | max_speed | bearing | +-----------+-----------+ | 99 | north | | 117 | northeast | | 93 | north | | 81 | northeast | | 78 | north | | 94 | northeast | | 107 | northeast | | 55 | north | +-----------------------+
You should apply GROUP BY to the smallest unit of time you want grouped, and to that unit only. Grouping by both a larger and smaller unit is known as “nesting”, and does not produce useful results.
Nested groups return the more fine-grained of the two groupings, and should be avoided. When you group rowtime by both minute and hour, for example, the results are the same as if you group rowtime by minute. When there is more than one grouping key, the effect is to group by the intersection of the groups.
(Functionality added for 7.2.4)
SELECT STREAM ... FROM ... GROUP BY FLOOR((EVENT_TIME WITHIN INTERVAL '5' MINUTE) TO HOUR);
In this example, rows can arrive up to 5 minute out of order. We don’t emit the result for 01:00:00 until at least 5 minutes after the end of the period, when we have seen a row (or rowtime bound) timed at or after 01:05:00.
Rows that arrive more than 5 minutes out of order will be treated as late rows and discarded - they will not be included in the aggregates.
In these respects, this GROUP BY will emit the same results as if you had used a T-Sort on the detail records using the ORDER BY … WITHIN clause, promoted the event timestamp to ROWTIME, and then performed a GROUP BY based on ROWTIME.
For more information on the features and benefits of this functionality see Combining T-Sort with GROUP BY.
When you aggregate rows using GROUP BY, the emitted columns will contain only the GROUP BY key expressions, plus all the aggregate function expressions you define using functions such as COUNT, AVG, SUM, MIN, and MAX.
The example below uses LAST_VALUE to return the column “highway”. This example would be most useful if you know “highway” does not change frequently.
SELECT STREAM AVG("speed") AS "avg_speed", LAST_VALUE("highway") AS "highway" FROM "StreamLab_Output_buses-data"."dashboard_pipe1_step_3" AS "input" GROUP BY STEP(s.ROWTIME BY INTERVAL '15' MINUTE),;
You can use IGNORE NULLS to ensure that a value is returned for FIRST_VALUE or LAST_VALUE. For example, let’s say a weather station sends weather data that may include readings for temperature and pressure in a single row.
Each row (report) may not have non-null values for both temperature & pressure. Let’s say you want get the most recent report for the temperature as well as the pressure from a weather station at the top of every 15 minutes. You can use LAST_VALUE to get the temperature and pressure at the 15 minute mark, using code similar to the following:
SELECT STREAM station_id, LAST_VALUE(temp) IGNORE NULLS AS temp, LAST_VALUE(pressure) IGNORE NULLS AS pressure FROM weather_data AS s GROUP BY STEP(s.ROWTIME BY INTERVAL '15' MINUTE), station_id;