Analytics commands let you apply a number of different analytics to the Output view in StreamLab. Some of these produce new columns, and others change selected columns themselves.
Analytic | Description |
---|---|
Aggregate | This command lets you aggregate a column, including COUNT, SUM, and AVERAGE. You apply these over sliding or tumbling windows. Windows express an amount of rows defined by time or number of rows, over which the aggregation is applied. |
Bollinger Bands | This command lets you create a new column that is a running average of another column over a given window of time. You can also add Bollinger bands to this running average. Bollinger Bands are a volatility indicator, typically used for stock prices. |
Calculate | This command lets you perform an arithmetical calculation on a column, such as adding 2000 to the column or dividing the column in 2. Selected columns must have numerical values, such as DOUBLE. |
Categorize | This command lets you take a continuous value, such as a bus speed, and replace it with strings such as "fast", "medium", or "slow" by applying conditions to the column. |
Distinct | The Distinct operation lets you eliminate rows with duplicate columns over a specified timeframe. |
Geo-Locate IP | This command lets you determine the latitude and longitude associated with IP addresses worldwide. |
GroupRank | This command lets you group rows by the values in one column (group by) and then rank the rows within that group according to another column (rank by) across a window defined by you. |
Join | Join lets you combine data in the current pipeline guide with data from another stream or table. They allow you to enrich data in the current view with data from another source, as long as that source shares a column with the source for the current guide. |
Time Sort | This command uses a sliding time-based window of incoming rows to reorder those rows by the selected column or by ROWTIME in order to address time synchronization problems with data. |
New Column | The New Column analytic lets you add a new column based on a custom SQL expression. |
The Aggregate analytic lets you make calculations on a column of numerical data, such as the sum, the average, the maximum value, the minimum value, the first value, or the last value. You can also choose statistical formulas such as population variance, population standard deviation, or sample variance. See Aggregation Operations below.
Because StreamLab works with streaming data, you first need to designate a window across which you will perform the aggregation. Windows can be sliding or tumbling and designated as either a numeric time frame (from seconds to years) or a number of rows (last 200 rows). With sliding windows, you can apply partitions and sessions
For tumbling windows, you can also group aggregations into categories, such as departments, industries, regions, states, and so on, using the Group By option, including grouping results by multiple categories, such as region, then department (use the + to add additional grouping columns).
You can apply multiple aggregations by using the + button in the right corner of the aggregation section of the Aggregate tab.
To implement a Aggregate analytic.
Select an operation to aggregate values for the new column. This might be an average temperature, or a running total. See Aggregation Operations below. You can add multiple analytics by using the + button.
Enter a column name for the resulting column or columns.
For sliding windows, you can also choose to group the column's results. To do so, choose a column with a limited set of values, such as region, department, product line, and so on. For time-based sliding windows, you can use time until to designate a time period that ends before the current row. See Using Time Until with Time-based Sliding Windows below. You can also choose to have the window hop forward once the stream reaches a designated time period. This means that, for example, rows will clear out and start accumulating anew once you reach the 5 minute mark for the hopping period. See Using a Hopping Period with Time-based Sliding Windows below. You can partition all sliding windows by column. See Partitioning Sliding Windows below.
For tumbling windows (which display data at specified intervals), you need to choose what to do with other columns: either drop these (so that only aggregated columns display), use the first value in the window, or use the last value in the window. With tumbling windows, you can also choose to apply a GROUP BY using a column of limited values. See Grouping Tumbling Windows below.
Click the + icon to add the command to the Guide script.
The results of the script appear in the Output View window.
All streaming rows have a timestamp called ROWTIME. This means that you can tell StreamLab "aggregate all the rows in the past hour" or "aggregate all the rows in the past minute" and so on. Long window frames may affect performance, so experiment accordingly.
Sliding windows display data continuously. Each row contains one or more running aggregations for the specified time period. In the illustration below, avg_order and max_order are calculated using all data in the window frame.
(SUM(expr*expr) - SUM(expr)\*SUM(expr) / COUNT(expr)) / COUNT(expr)
(SUM(expr\*expr) - SUM(expr)\*SUM(expr) / COUNT(expr)) / (COUNT(expr)-1)
Time-based sliding windows identify groups of rows based on a specified time period in relation to the current row.
You define time-based windows in terms of 1) a time when the window begins. 2) a time when the window ends. 3) whether or not the window should "hop". Hopping windows accumulate data in a specified time period (1 minute, 5 minutes, 1 hour) and then reset once the time period has elapsed.
Sliding windows emit one row for every row that enters the query (through a simple INSERT, a pump or an external source accessed by a foreign stream). The window frame of a sliding window moves along one row at a time. Rows exit the window as they move out of the defined frame. If you define a window as beginning five minutes ago, rows move out of the window once their timestamp is greater than 5 minutes past the current row. If you define a window as 200 rows, rows move out of the window once the 201st row enters the window.
By default, windows have an upper bound of the current row. For some calculations, you will want to exclude the current row by designating a window that starts before the current row. Otherwise, a given calculation may miss additional rows that come in with the same ROWTIME as the current row, since StreamLab begins emitting output rows as soon as the current row hits its system. Using time untilensures that you have captured all rows with the same ROWTIIME in a calculation.
However, when you define windows this way, youintroduce some latency. In cases where it is not important to calculate rows with the same rowtime, you can use the default upper bound.
Windows that use time until let you exclude the current row from calculations such as AVG or MAX. This makes statistical tools like Bollinger bands or the naive bayes classifier statistically valid.
In the illustration below, values for the row with a rowtime of '2019-03-30 03:05:30.0' include rows fro. '2019-03-30 03:00:30.0' t. '2019-03-30 03:04:30.0', but do not include the current row. If no data is available for the window, as in the first row, values return null.
When you designate a hopping period, the window will slide forward at the top of every the designated hopping periodROWTIMEs in the 3:00 hour will accumulate in the window until a row arrives with a ROWTIME of 4:00 or later. At that point, the window starts over; the first calculation will include only one row, the next two rows, and so on, until a row arrives with a ROWTIME of 5:00 or later. When you use a hopping window with COUNT, for example, counts will go up for the entire hour, then start at zero at the beginning of the new hour. Data from the previous hour is "dropped" from aggregation.
Hopping windows are particularly useful for aggregations like AVG, VARIANCE, STDDEV on large time windows (such as hours, days, or weeks) on high datarate streams.
Hopping windows have significant benefits.
The illustration below shows the results of a query run on the view defined above. In the illustration, calculations for the first 9 rows include rows from '2019-03-30 03:02:00.0' to '2019-03-30 03:59:45.0'. Once a row arrives later than '2019-03-30 03:59:59.999', calculations reset to newly arriving rows. In the last 5 rows, calculations include rows from '2019-03-30 04:02:00.0' to '2019-03-30 04:18:00.0'.
Row-based sliding windows identify groups of rows based on N number of rows in relation to the current row.
You define row-based windows in terms of
In the illustration below, for the first row, only that row is included in the window. The 6th row calculates values using data from rows 1, 2, 3, 4, 5, and 6. The 10th row calculates values using data from rows 5, 6, 7, 8, 9, and 10.
Both time-based and row-based sliding windows can be partitioned by one or more columns. (You cannot use partitions with tumbling windows.)
When you apply partitioning, calculations incorporate both the window frame and the partitioned column value. This means that for an aggregation such as average, the calculated average will include values for all rows that 1) arrive within the window frame and 2) contain the same value in the partition column.
For example, if the partition column contains stock tickers, and you have chosen average for an aggregation, the average will be calculated over all rows that contain, for example, ORCL, and fall within the window frame.
To partition a window:
In the illustration, all rows with the value 'IBM' for the column ticker that have been received up until this point and fall within the window frame are calculated together.
(new in StreamLab version 2.3.1)
You can also define sessions for a column or columns. You do not need to choose a column to partition in order to use sessions. (You cannot use sessions with tumbling windows.)
With sessions, you can determine the location and size of a partitioned window using the input data itself. For example, you might have a session applied to a customer's id, which you could use to develop aggregates for the customer's orders over the past day.
Sessions can start and end in one of two ways:
You can use these flags together.
Note: Sessions are not currently supported for windows defined with GROUP BY.
To use sessions:
This data window also uses ROWTIME, and lets you tell StreamLab. "aggregate data for an hour, then clear data, then give me data for the next hour", at, for example, 2:00 PM, 3:00 PM, 4:00 PM, and so on.
Tumbling windows display data only at specified intervals. That is, if you choose 5 seconds, a new row appears every five seconds with aggregated data. For non-aggregated rows (the rows that are, for example, averaged for the aggregation), you can choose to either drop non-aggregated columns, display the first value for non-aggregated columns, or display the last value.
With tumbling windows, you can also choose to use a group by, drawing on a column with limited values, such as "dept A, dept B, dept C". When you apply a group by, StreamLab emits one row for each group that appears window, that is, every 10 seconds a row for dept A, B, and C, provided that each group has appeared in the window.
Tumbling windows emit one row for every group of rows aggregated. As with sliding windows, a calculation such as sum or average incorporates all the rows in the window, but a row will appear only once during the window frame. So if you designate a window of 200 rows, one row will appear for every 200 rows that enter StreamLab. Similarly, if you designate a window of 5 minutes, one row will appear for all rows that enter StreamLab during a 5 minute interval.
ROWTIME | AMOUNT | TICKER |
---|---|---|
2019-03-30 03:02:00.000 | 20 | ORC. |
2019-03-30 03:03:00.000 | 30 | IB. |
2019-03-30 03:04:00.000 | 100 | ORC. |
2019-03-30 03:04:30.000 | 40 | IB. |
2019-03-30 03:05:00.000 | 100 | MSF. |
2019-03-30 03:05:30.000 | 50 | MSF. |
2019-03-30 04:04:00.000 | 100 | GOOG. |
2019-03-30 04:06:00.000 | 100 | ORC. |
2019-03-30 04:18:00.000 | 1000 | GOGG. |
2019-03-30 04:44:00.000 | 1000 | ORC. |
With tumbling windows, you can apply a group by, drawing on a column with limited values. When you apply a group by, StreamLab emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.
In the screen shot below, for example, StreamLab emits one aggregated row every ten seconds for trip_no 500s and 600s. During the first window, no rows with 600s appeared, so StreamLab only emits an aggregated row for 500s.
With tumbling windows, you can apply a group by, drawing on a column with limited values. When you apply a group by, StreamLab emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.
In the screen shot below, for example, StreamLab emits one aggregated row every ten seconds for trip_no 500s and 600s. During the first window, no rows with 600s appeared, so StreamLab only emits an aggregated row for 500s.
The Calculate command lets you perform an arithmetical calculation on a column. Use @ for the value of the column. Selected columns must have numerical values, such as DOUBLE. Calculations might be something like the following:
Calculation | Result |
---|---|
@+2000 | Adds 2000 to the column value. |
@-800 | Subtracts 800 from the column value. |
@/2 | Divides the column value by 2. |
@*4 | Multiplies the column value by 4. |
To implement a Calculate command:
The Categorize command lets you take a continuous value, such as a bus speed, and replace it with strings such as "fast", "medium", or "slow" by applying conditions to the column. You can process more than one column at once. Because this command replaces the column with one containing strings, you may want to make a copy of the column first, using the New Column command.
To implement a Categorize command:
To add more conditions, click the + icon in the Category window.
The Distinct operation lets you eliminate rows with duplicate columns over a specified timeframe.
Choosing Promote to ROWTIME? will produce an output view showing only rows with distinct columns. You can also choose to not promote the new timestamp to ROWTIME, which will allow you to see the new timestamp for which rows are evaluated.
Note: If you do not promote to ROWTIME, new timestamps will display as one millisecond behind the row's rowtime.
To implement Distinct.
For example, if you chose to eliminate duplicate rows over a 1 minute timeframe, and StreamLab received the data below, the last 5 rows would be eliminated because they are duplicates.
2019-03-30 04:18:00.000 | GOOGL | 100 |
---|---|---|
2019-03-30 04:18:00.000 | GOOGL | 100 |
2019-03-30 04:18:00.000 | IBM | 15 |
2019-03-30 04:43:00.000 | IBM | 60 |
2019-03-30 04:44:00.000 | ORCL | 1000 |
2019-03-30 04:46:00.000 | ORCL | 3000 |
2019-03-30 05:03:00.000 | IBM | 30 |
2019-03-30 05:03:01.000 | IBM | 30 |
2019-03-30 05:03:02.000 | IBM | 30 |
2019-03-30 05:03:03.000 | IBM | 30 |
2019-03-30 05:03:04.000 | IBM | 30 |
2019-03-30 05:03:05.000 | IBM | 30 |
The Bollinger Bands analytic command lets you create a new column that is a running average of another column over a given window of time. This analytic works especially well for streaming data. The diagram below depicts the creation of a running average on a stock's closing price.
You can also add Bollinger bands to this running average. Bollinger Bands are a volatility indicator, typically used for stock prices.
Bollinger Bands consist of:
StreamLab also lets you define inner and outer values for the upper and lower bands.
The Time Sort uses a sliding time-based window of incoming rows to reorder those rows by the selected column or by ROWTIME in order to address time synchronization problems with data. For example, sometimes data arriving from multiple sources may not be time-synchronized. Such problems can produce unreliable results from analytics, which typically depend on arriving data being ordered by ROWTIME. You will likely only need to use this command if you are seeing unreliable results in your data that are time-related.
In the Time Sort command, you can also choose to promote a selected column to rowtime. You do so in cases where you know that another column contains a timestamp more relevant than the time the row entered s-Server. This might be, for example, the time an earthquake actually occurred instead of the time the row enters s-Server. These values can have slight lags, making data calculations less reliable.
To implement a Time Sort command:
The Window command lets you perform a windowed aggregation on a selected column. The command produces a new column with the result of the operation. To implement a Window command:
The Geo-Locate IP command lets you determine the latitude and longitude associated with IP addresses worldwide.
To use the Geo-Locate IP command:
The Geo-Locate IP command uses the MaxMind GeoIp City database. See https://www.maxmind.com/en/geoip2-city for more details on the MaxMind GeoIp City database.
Note: s-Server ships with a free version of this database. This topic describes how the plugin finds the MaxMind data file and how one should use the licensed version for production. You can use this free default database for testing, but for production use, you should use at least an up-to-date GeoLiteCity.dat, or probably a licensed data file.
GroupRank lets you group rows by the values in one column (group by) and then rank the rows within that group according to another column (rank by) across a window defined by you. To implement a GroupRank operation:
Join lets you combine data in the current pipeline guide with data from another stream or table. They allow you to enrich data in the current view with data from another source, as long as that source shares a column with the source for the current guide.
For example, if your guide has columns of data on bus positions and speed, and one of the columns contains a driver identification number, you might join the guide with a database table that has information on the driver's names matched to their identification numbers. By doing so, you can enrich your guide by adding columns from the DRIVERS table.
StreamLab supports stream/stream, stream/table, table/stream and table/table joins. These are implemented using the SQL JOIN clause. It also offers a mode called lookup, which provides some control over whether and how records in the lookup table are cached. The lookup mode is implemented using the TableLookup UDX.
StreamLab supports four types of joins:
Join type | Explanation |
---|---|
INNER JOIN | Returns all pairs of rows from left and right for which the join condition evaluates to TRUE. |
LEFT JOIN | As INNER JOIN, but rows from the left are kept even if they do not match any rows on the right; NULL values are generated on the right. |
RIGHT JOIN | As INNER JOIN, but rows from the right are kept even if they do not match any rows on the left; NULL values are generated on the left for these rows. |
FULL JOIN | As INNER JOIN, but rows from both sides are kept even if they do not match any rows on the other side; NULL values are generated on the other side for these rows. |
To implement a Join analytic.
Choose whether to join to a stream, a table, or to do a lookup table. Only streams and tables that have been added to StreamLab will be available here. Do this first, because your options will change depending on your choice.
Next, choose whether to conduct the join over a window. You'll get better results if you do. You can configure windows on both the left (the pipeline guide input) and the right (the stream/table to be joined).
For each window you've chosen:
Choose whether to create a time-based or row based window.
Enter values for the window's beginning and end, such as "10 minutes to 0 minutes ago" or "1 hour to 10 minutes ago" or "100 rows to 20 rows ago".
​ The window types and lengths for the left window and the right window do not need to match.
Choose a join type: inner, left, right, or full. For lookup, you do not need to choose a join type.
Choose a stream or table to join to. The selection will change depending on what you chose in step 1. For stream you see a list of streams, for table a list of tables, and for lookup a list of external tables.
Finally, choose a column from the current stream and the stream/table to be joined. These must match in order for the join to work. You can match on more than one pair of columns (up to three) by clicking + in the top right of the parameter box.
Click the + icon to add the command to the Guide script.
The results of the script appear in the Output View window.
When you define a window using rows, you are defining a ROWS BETWEEN
When you define a window using time, you will see time from and time until parameters; you are defining a RANGE BETWEEN INTERVAL
Each match is treated as an equi-join, whereby pipeline guide column = join to column.
Lookup supports a maximum of three predicates
To learn more about windows in streaming joins, see the topic WINDOW clause in the s-Server Streaming SQL Reference Guide.
The join to can be an SQLstream Stream, or an External Stream that has been created with a Render Mode of either Standard: FS->pump->stream or Directly as stream (implemented as a FOREIGN STREAM)
SQLstream Streams can be defined from the Sources (either creating them individually or using Import SQLstream Schema)
It is legitimate to join a stream to itself (especially when using windowing).
The join to can be an External Table, or an External Stream that has been created with a Render Mode of Directly as table (which is implemented as a FOREIGN TABLE).
It is legitimate to join a table to itself. For example you might want to join SALES.EMPS to itself to get information about the employee's manager.
Note: the entire table will be read into memory when the guide starts running; this means that:
The New Column analytic lets you add a new column based on a custom SQL expression.
To add a new
To implement a New Column command:
The New Column analytic lets you add a column to the guide using a SQL expression. Using this feature requires both knowledge of SQL and knowledge of s-Server's current contents (in that you will need to work with streams and tables in s-Server). For more information on s-Server's streaming SQL, see SQLstream SQL Reference Guide.
The following analytics are provided in order to provide access to advanced s-Server features. Using them requires advanced knowledge of SQL.
This command lets you enter the name of a predefined custom user transform. A user-defined transform is a user-defined functionthat returns a table or a stream. Its input arguments can be scalars or cursors. A cursor is an object representing a subquery, which the user-defined transform can use to read the subquery results. The subquery can be relational or streaming.
The UDX does have to be predefined and it cannot take arguments. It can return either a table or a stream. If it is a table you can specify a column that will contain the ROWTIME.
The SQL you enter here will be used to transform the stream you are building in the Guide. The actual SQL added to the script will be in the following form:
CREATE VIEW (output) AS SELECT (stream) <<your SQL here>> FROM (input)
(input), (output) and (stream) are all filled in by StreamLab.
The Custom SQL analytic lets you inject any block of SQL into s-Server.