Analytics

Analytics commands let you apply a number of different analytics to the Output view in StreamLab. Some of these produce new columns, and others change selected columns themselves.

  • Aggregate. This command lets you create quick “buckets” of data for a particular column, including COUNT, SUM, and AVERAGE.
  • Calculate. This command lets you perform an arithmetical calculation on a column, such as adding 2000 to the column or dividing the column in 2. Selected columns must have numerical values, such as DOUBLE.
  • Categorize. This command lets you take a continuous value, such as a bus speed, and replace it with strings such as “fast”, “medium”, or “slow” by applying conditions to the column.
  • Running Average. This command lets you create a new column that is a running average of another column over a given window of time.
  • Time Sort. This command uses a sliding time-based window of incoming rows to reorder those rows by the selected column or by ROWTIME in order to address time synchronization problems with data.
  • Window. This command lets you perform a windowed aggregation. on a selected column. The command produces a new column with the result of the operation.
  • Geo-Locate IP. This command lets you determine the latitude and longitude associated with IP addresses worldwide.
  • GroupRank. This command lets you group rows by the values in one column (group by) and then rank the rows within that group according to another column (rank by) across a window defined by you.
  • Partition Window. This command lets you partition rows into groups using the columns in a particular column.
  • Table Lookup. This command lets you enrich a guide with data stored in an external database.

Aggregate

The Aggregate analytic lets you make calculations on a column of numerical data, such as the sum, the average, the maximum value, the minimum value, the first value, or the last value. You can also choose statistical formulas such as population variance, population standard deviation, or sample variance. See Aggregation Operations below.

The Aggregate analytic lets you make calculations on a column of numerical data, such as the sum, the average, the maximum value, the minimum value, the first value, or the last value. You can also choose statistical formulas such as population variance, population standard deviation, or sample variance. See Aggregation Operations below.

Because StreamLab works with streaming data, you first need to designate a window across which you will perform the aggregation. Windows can be sliding or tumbling and designated as either a numeric time frame (from seconds to years) or a number of rows (last 200 rows).

For tumbling windows, you can also group aggregations into categories, such as departments, industries, regions, states, and so on, using the Group By option, including grouping results by multiple categories, such as region, then department (use the + to add additional grouping columns).

You can apply multiple aggregations by using the + button in the right corner of the aggregation section of the Aggregate tab.

Implementing an Aggregate Analytic

To implement a Aggregate analytic.

  1. Select a mode for the aggregation, sliding or tumbling. (See definitions below.)
  2. Select a length and unit for the aggregation window, such as “1 hour” or “15 minutes” or “200 rows.”
  3. Select a column to aggregate. This column must contain numerical data.
  4. Select an operation to aggregate values for the new column. This might be an average temperature, or a running total. See Aggregation Operations below. You can add multiple analytics by using the + button.

  5. Enter a column name for the resulting column or columns.

    For sliding windows, you can also choose to group the column’s results. To do so, choose a column with a limited set of values, such as region, department, product line, and so on. For time-based sliding windows, you can use time until to designate a time period that ends before the current row. See Using Time Until with Time-based Sliding Windows below. You can also choose to have the window hop forward once the stream reaches a designated time period. This means that, for example, rows will clear out and start accumulating anew once you reach the 5 minute mark for the hopping period. See Using a Hopping Period with Time-based Sliding Windows below. You can partition all sliding windows by column. See Partitioning Sliding Windows below.

    For tumbling windows (which display data at specified intervals), you need to choose what to do with other columns: either drop these (so that only aggregated columns display), use the first value in the window, or use the last value in the window. With tumbling windows, you can also choose to apply a GROUP BY using a column of limited values. See Grouping Tumbling Windows below.

  6. Click the + icon to add the command to the Guide script.

  7. The results of the script appear in the Output View window.

Using Sliding Windows

All streaming rows have a timestamp called ROWTIME. This means that you can tell StreamLab “aggregate all the rows in the past hour” or “aggregate all the rows in the past minute” and so on. Long window frames may affect performance, so experiment accordingly.

Sliding windows display data continuously. Each row contains one or more running aggregations for the specified time period. In the illustration below, avg_order and max_order are calculated using all data in the window frame.

sliding-window

Using Time-based Sliding Windows

Time-based sliding windows identify groups of rows based on a specified time period in relation to the current row.

You define time-based windows in terms of 1) a time when the window begins. 2) a time when the window ends. 3) whether or not the window should “hop.” Hopping windows accumulate data in a specified time period (1 minute, 5 minutes, 1 hour) and then reset once the time period has elapsed.

Sliding windows emit one row for every row that enters the query (through a simple INSERT, a pump or an external source accessed by a foreign stream). The window frame of a sliding window moves along one row at a time. Rows exit the window as they move out of the defined frame. If you define a window as beginning five minutes ago, rows move out of the window once their timestamp is greater than 5 minutes past the current row. If you define a window as 200 rows, rows move out of the window once the 201st row enters the window.

sl_aggregate-sliding-time

Using Time Until with Time-based Sliding Windows

By default, windows have an upper bound of the current row. For some calculations, you will want to exclude the current row by designating a window that starts before the current row. Otherwise, a given calculation may miss additional rows that come in with the same ROWTIME as the current row, since StreamLab begins emitting output rows as soon as the current row hits its system. Using time untilensures that you have captured all rows with the same ROWTIIME in a calculation.

However, when you define windows this way, youintroduce some latency. In cases where it is not important to calculate rows with the same rowtime, you can use the default upper bound.

Windows that use time until let you exclude the current row from calculations such as AVG or MAX. This makes statistical tools like Bollinger bands or the naive bayes classifier statistically valid.

In the illustration below, values for the row with a rowtime of ‘2019-03-30 03:05:30.0’ include rows fro. ‘2019-03-30 03:00:30.0’ t. ‘2019-03-30 03:04:30.0’, but do not include the current row. If no data is available for the window, as in the first row, values return null.

offset-window

Using a Hopping Period with Time-based Sliding Windows

When you designate a hopping period, the window will slide forward at the top of every the designated hopping periodROWTIMEs in the 3:00 hour will accumulate in the window until a row arrives with a ROWTIME of 4:00 or later. At that point, the window starts over; the first calculation will include only one row, the next two rows, and so on, until a row arrives with a ROWTIME of 5:00 or later. When you use a hopping window with COUNT, for example, counts will go up for the entire hour, then start at zero at the beginning of the new hour. Data from the previous hour is “dropped” from aggregation.

Hopping windows are particularly useful for aggregations like AVG, VARIANCE, STDDEV on large time windows (such as hours, days, or weeks) on high datarate streams.

Hopping windows have significant benefits.

  • They reduce the amount of system resources required to maintain the window, since the window only needs to maintain aggregated history for each time interval (1 hour in the example above).
  • When used with an offset, such as RANGE BETWEEN INTERVAL ‘60’ MINUTE PRECEDING AND INTERVAL ‘1’ MINUTE PRECEDING, the results for each input row are computed on a window that excludes data from the current row. This is highly desirable since any anomaly in streaming data does not adversely impact computation of Bollinger bands or Naive Bayes classifiers. We recommend using hopping windows with an offset. The offset makes both ends of a window frame hop forward in a 1 minute interval. That is, as each row arrives for the current minute interval, it does not contribute to the aggregation results for that row since its not in the window frame for the current row.
  • When used without an offset, results for hopping windows are computed for each input row with zero latency.

The illustration below shows the results of a query run on the view defined above. In the illustration, calculations for the first 9 rows include rows from ‘2019-03-30 03:02:00.0’ to ‘2019-03-30 03:59:45.0’. Once a row arrives later than ‘2019-03-30 03:59:59.999’, calculations reset to newly arriving rows. In the last 5 rows, calculations include rows from ‘2019-03-30 04:02:00.0’ to ‘2019-03-30 04:18:00.0’.

hopping-window

Using Row-based Sliding Windows

Row-based sliding windows identify groups of rows based on N number of rows in relation to the current row.

You define row-based windows in terms of

  1. when the window begins, either unbounded (all rows received up until the current row) or a specified number of rows (such as 200 rows).
  2. when the window ends, either the current row or N rows preceding (such as “until 10 rows ago”). Row-based windows are defined as the current row plus X number of rows preceding. A window such ROWS ‘3’ PRECEDING will actually contain 4 rows: 3 preceding rows plus the current row.
sl_aggregate-sliding-row

In the illustration below, for the first row, only that row is included in the window. The 6th row calculates values using data from rows 1, 2, 3, 4, 5, and 6. The 10th row calculates values using data from rows 5, 6, 7, 8, 9, and 10.

rows-based-window

Partitioning Sliding Windows

Both time-based and row-based sliding windows can be partitioned by one or more columns. When you apply partitioning, calculations incorporate both the window frame and the partitioned column value. This means that for an aggregatiaon such as average, the sum will include the column amount for all rows with tickers matching ORCL and arriving within the window frame.

In the illustration, all rows with the value ‘IBM’ for the column ticker that have been received up until this point and fall within the window frame are calculated together.

partitioned-window

Using Tumbling Windows

This data window also uses ROWTIME, and lets you tell StreamLa. “aggregate data for an hour, then clear data, then give me data for the next hour,” at, for example, 2:00 PM, 3:00 PM, 4:00 PM, and so on.

Tumbling windows display data only at specified intervals. That is, if you choose 5 seconds, a new row appears every five seconds with aggregated data. For non-aggregated rows (the rows that are, for example, averaged for the aggregation), you can choose to either drop non-aggregated columns, display the first value for non-aggregated columns, or display the last value.

With tumbling windows, you can also choose to use a group by, drawing on a column with limited values, such as “dept A, dept B, dept C.” When you apply a group by, StreamLab emits one row for each group that appears window, that is, every 10 seconds a row for dept A, B, and C, provided that each group has appeared in the window.

Options for Tumbling Windows

Tumbling windows emit one row for every group of rows aggregated. As with sliding windows, a calculation such as sum or average incorporates all the rows in the window, but a row will appear only once during the window frame. So if you designate a window of 200 rows, one row will appear for every 200 rows that enter StreamLab. Similarly, if you designate a window of 5 minutes, one row will appear for all rows that enter StreamLab during a 5 minute interval.

ROWTIME AMOUNT TICKER
2019-03-30 03:02:00.000 20 ORC.
2019-03-30 03:03:00.000 30 IB.
2019-03-30 03:04:00.000 100 ORC.
2019-03-30 03:04:30.000 40 IB.
2019-03-30 03:05:00.000 100 MSF.
2019-03-30 03:05:30.000 50 MSF.
2019-03-30 04:04:00.000 100 GOOG.
2019-03-30 04:06:00.000 100 ORC.
2019-03-30 04:18:00.000 1000 GOGG.
2019-03-30 04:44:00.000 1000 ORC.
sl_aggregate-tumbling

Grouping Tumbling Windows

With tumbling windows, you can apply a group by, drawing on a column with limited values. When you apply a group by, StreamLab emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.

In the screen shot below, for example, StreamLab emits one aggregated row every ten seconds for trip_no 500s and 600s. During the first window, no rows with 600s appeared, so StreamLab only emits an aggregated row for 500s.

sl_aggregate_group_by_result

Aggregation Operations

Average. Returns the average of all the value expressions evaluated for each row in the aggregation. Count. Returns the number of qualifying rows in the aggregation Count Distinct. Returns the number of qualifying non-duplicate rows in the aggregation First Value. Returns the evaluation of the from the first row that qualifies for the aggregate. First Value Ignore Nulls. Returns first non null value of x in First Value Respect Nulls. Returns first value, including null of x in Last Value. LAST_VALUE returns the evaluation of the from the last row that qualifies for the aggregate. Returns last value, including null of x in Last Value Ignore Nulls. Returns last non null value of x in Last Value Respect Nulls. Returns last value, including null of x in Maximum. Returns the maximum value of all the value expressions evaluated for each row in the aggregation. For string values, MAX is determined by which string is last in the collating sequence. Minimum. Returns the minimum value of all the value expressions evaluated for each row in the aggregation. Sum. SUM returns the sum of all the value expressions evaluated for each row in the aggregation. Population Variance. This is a standard statistical analytic method that returns the population variance of a non-null set of numbers (nulls being ignored). Population variance uses the following calculation: . . (SUM(expr*expr) - SUM(expr)\*SUM(expr) / COUNT(expr)) / COUNT(expr) . In other words, for non-null values in the aggregation window, and using S1 as the sum of the values and S2 as the sum of the squares of the values, Population Variance returns the result (S2-S1S1/N)/N. Population Standard Deviation. This is a standard statistical analytic method that returns the square root of the population variance (defined below) of the row’s value, evaluated for each row in the aggregated window. Sample Variance. This is a standard statistical analytic method that returns*eturns the sample variance of a non-null set of numbers (nulls being ignored). Sample Variance uses the following calculation: (SUM(expr\*expr) - SUM(expr)\*SUM(expr) / COUNT(expr)) / (COUNT(expr)-1) In other words, for a given set of non-null values, using S1 as the sum of the values and S2 as the sum of the squares of the values, VAR_POP returns the result (S2-S1*S1/N)/(N-1). Sample Standard Deviation. This is a standard statistical analytic method that returns the statistical standard deviation of all values in number-expression, evaluated for each row remaining in the group and defined as the square root of the sample variance (defined above).

Using Group By with Tumbling Windows

With tumbling windows, you can apply a group by, drawing on a column with limited values. When you apply a group by, StreamLab emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.

In the screen shot below, for example, StreamLab emits one aggregated row every ten seconds for trip_no 500s and 600s. During the first window, no rows with 600s appeared, so StreamLab only emits an aggregated row for 500s.

Calculate

The Calculate command lets you perform an arithmetical calculation on a column. Use @ for the value of the column. Selected columns must have numerical values, such as DOUBLE. Calculations might be something like the following:

Calculation Result
@+2000 Adds 2000 to the column value.
@-800 Subtracts 800 from the column value.
@/2 Divides the column value by 2.
@*4 Multiplies the column value by 4.

To implement a Calculate command:

  • Select column(s) to which to apply categories. Each column must contain numerical data.
  • Enter an arithmetical calculation for the column.
  • Click the + icon  in the Suggestions list to add the command to the Guide script.
  • The results of the script appear in the Output View window.

Running Average

The Running Average analytic command lets you create a new column that is a running average of another column over a given window of time. This analytic works especially well for streaming data. The diagram below depicts the creation of a running average on a stock’s closing price.

You can also add Bollinger bands to this running average. Bollinger Bands are a volatility indicator, typically used for stock prices.

Bollinger Bands consist of:

  • an moving average (avg) over a given period T.
  • an upper band at K times an T-period standard deviation above the moving average (avg + Kσ)
  • a lower band at K times an T-period standard deviation below the moving average (avg − Kσ)

StreamLab also lets you define inner and outer values for the upper and lower bands.

Time Sort

The Time Sort uses a sliding time-based window of incoming rows to reorder those rows by the selected column or by ROWTIME in order to address time synchronization problems with data. For example, sometimes data arriving from multiple sources may not be time-synchronized. Such problems can produce unreliable results from analytics, which typically depend on arriving data being ordered by ROWTIME. You will likely only need to use this command if you are seeing unreliable results in your data that are time-related.

In the Time Sort command, you can also choose to promote a selected column to rowtime. You do so in cases where you know that another column contains a timestamp more relevant than the time the row entered s-Server. This might be, for example, the time an earthquake actually occurred instead of the time the row enters s-Server. These values can have slight lags, making data calculations less reliable. To implement a Time Sort command:

  • Select a column containing a time stamp.
  • Enter a window length and unit on which to sort rows.
  • Choose whether or not to promote the row to ROWTIME.
  • Click the + icon  in the Suggestions list to add the command to the Guide script.
  • The results of the script appear in the Output View window.

Window

The Window command lets you perform a windowed aggregation on a selected column. The command produces a new column with the result of the operation. To implement a Window command: - Enter a column name for the resulting column. - Select an operation to aggregate values for the new columns. This might be an average temperature, or a running total. See - Operations Used in Commands* for more details on these operations. - Select a column on which to perform the aggregation. This column must contain numerical data. - Select a length and unit for the aggregation window, such as “1 hour” or “15 minutes.” - Click the + icon to add the command to the Guide script. - The results of the script appear in the Output View window.

Geo-LocateIP

The Geo-Locate IP command lets you determine the latitude and longitude associated with IP addresses worldwide.

To use the Geo-Locate IP command:

  1. Select a column that contains IP addresses in dotted-decimal format.
  2. Enter a prefix for the new latitude and longitude columns.
  3. A suggestion appears. Click the + icon to add the command to the Guide script.
  4. The results of the script appear in the Output View window.

The Geo-Locate IP command uses the MaxMind GeoIp City database. See http://www.maxmind.com/en/city for more details on the MaxMind GeoIp City database.

Note: s-Server ships with a free version of this database. This topic describes how the plugin finds the MaxMind data file and how one should use the licensed version for production. You can use this free default database for testing, but for production use, you should use at least an up-to-date GeoLiteCity.dat, or probably a licensed data file.

GroupRank

GroupRank lets you group rows by the values in one column (group by) and then rank the rows within that group according to another column (rank by) across a window defined by you. To implement a GroupRank operation:

  • Enter a name for the column that will display length
  • Select a column that StreamLab will use to group rows over the - group window*.
  • Select a column that StreamLab will use to rank rows within the group defined above.
  • Select an operation to aggregate values within the group. See - Operations Used in Commands* for more details on these operations.
  • Select a length and unit for the group window, such as “1 hour” or “15 minutes.”
  • Chose to sort group rows in either ascending or descending order.
  • Choose to order output rows in either ascending or descending order.
  • Choose rolling or tumbling window type. See below for more information on these.
  • If desired, define a max idle time for holding a group for ranking. When maxIdle expires the current group is released to the stream. A value of zero indicates no idle time out.
  • If desired, indicate the maximum number of rows the UDX will output in a given group. A value of 0 indicates no limit.
  • Click the + icon to add the command to the Guide script.
  • The results of the script appear in the Output View window.

Partition Window

The Partition Window command lets you partition rows into groups using the columns in a particular column. To implement a Partition Window command:

  • Enter a column name for the resulting column.
  • Select a column whose values you will use to group rows.
  • Select an operation to aggregate values for the new column. This might be an average temperature, or a running total. See Operations Used in Commands for more details on these operations.
  • Select a column on which to perform the aggregation. This column must contain numerical data.
  • Select a length and unit for the aggregation window, such as “1 hour” or “15 minutes.”
  • Click the + icon to add the command to the Guide script.
  • The results of the script appear in the Output View window.

Table Lookup

The Table Lookup analytic lets you enrich a guide with data stored in an external database. You do so by joining one of the columns in the guide with one of the columns in the database table. For example, if your guide has columns of data on bus positions and speed, and one of the columns, contains a driver identification number, you might join the guide with a database table that has information on the driver’s names matched to their identification numbers. By doing so, you can add one or more new columns to your guide with enriched information.

To use table lookup, you first need to create an external connection. See the topic StreamLab External Connections Overview in this guide for more details. You then need to add one of the tables from the external connection as a source. See the topic Adding an External Table Source in this guide for more details.

Once you have an external table added as a source, it will be available in the table menu of the Table Lookup command.

To use Table Lookup:

  • Select a table from the table popup menu.
  • Choose a column from the current guide (left) and external table (right). These columns will be matched by StreamLab, and used to fill in the guide with additional columns from the external table.
  • If you know the number of rows contained in the external table, you can speed processing by entering a cache size and clicking the Prefetch Rows option.
  • You can use a feature called fuzzy lookup if your column uses numbers, and if these numbers only need to be close. Working with latitudes and longitudes is good use case for this setting, since you will not necessarily get an equals match when joining on latitude or longitude. In this case, with fuzzy set to true, numbers that are close to each other will match.

Advanced Analytics

The following analytics are provided in order to provide access to advanced s-Server features. Using them requires advanced knowledge of SQL.

  • Call UDX This command lets you enter the name of a predefined custom user transform. A user-defined transform is a user-defined functionthat returns a table or a [stream]/glossary/stream/). Its input arguments can be scalars or cursors.
  • Custom View This comand lets you enter SQL that will be used to transform the stream you are building in the Guide.
  • Inject SQL This command lets you inject any block of SQL into s-Server through StreamLab.

Call UDX

This command lets you enter the name of a predefined custom user transform. A user-defined transform is a user-defined functionthat returns a table or a stream. Its input arguments can be scalars or cursors. A cursor is an object representing a subquery, which the user-defined transform can use to read the subquery results. The subquery can be relational or streaming.

The UDX does have to be predefined and it cannot take arguments. It can return either a table or a stream. If it is a table you can specify a column that will contain the ROWTIME.

Custom View

The SQL you enter here will be used to transform the stream you are building in the Guide. The actual SQL added to the script will be in the following form:

   CREATE VIEW (output) AS SELECT (stream) <<your SQL here>> FROM (input)

(input), (output) and (stream) are all filled in by StreamLab.

Inject SQL

The Inject SQL analytic lets you inject any block of SQL into s-Server.