This tutorial walks you through using Guavus SQLstream's SQL-generating, browser-based graphical tool, StreamLab.
StreamLab is a web application development environment that automatically generates streaming SQL. StreamLab lets you set up projects where you can parse incoming streaming data, generate real time analytics for it, visualize it, so that you can take action on it. You can gather streaming data from log files, sensors, message buses, network feeds, applications, databases, and other sources. StreamLab does not require any knowledge of SQL programming to use it. But if you do know SQL, you can use StreamLab to generate blocks of SQL, and then modify these in another application.
The tutorial uses a set of data from buses in Sydney, Australia. Each bus sends out a row of data to a server, which continually writes this data to a text file known as a log file. Log files are used by most servers to output activity information for the server.
In this tutorial, you do the following:
Let's begin by connecting to the log file and testing access to it.
To begin working with data in StreamLab, the first thing you need to do is to connect to a streaming data source. The kind of data source will depend on your system's needs, but might include log files, network feeds, Kafka topics, and so on.
In the following exercise, we will capture data from a streaming log file and use this data to identify buses that exceed the speed limit in the Sydney area. Because this data is streaming, speeders will be identified in real time. We will derive this data by connecting to a simulated streaming log file located on your local machine. In this case, the log file records messages sent out by buses in Sydney, Australia.
To begin, you need to start StreamLab and WebAgent. If you are running Guavus SQLstream in a Docker container or other virtual machine, you can do so from the Guavus SQLstream Cover Page. If you are running Guavus SQLstream on Linux on a regular install, you can start both StreamLab and the streaming data source manually.
This file features data in the following categories:
|id||DOUBLE||Identification number for the bus.|
|reported_at||TIMESTAMP||Time location was reported.|
|shift_no||DOUBLE||Shift number for the bus's driver.|
|driver_no||DOUBLE||Driver identification for number.|
|prescribed||VARCHAR(4096)||The direction on the motorway (into Sydney or out of Sydney).|
|highway||DOUBLE||Highway number if available.|
|gps||VARCHAR||GPS information with latitude, longitude, and bearing in JSON format.|
You will use StreamLab to capture information from this file.
Launch StreamLab by entering localhost:5590 in your browser. We recommend Chrome on Windows, Safari on Mac, and Chromium on Linux.
If you're running in a virtual machine, use the browser in your host OS for the best performance.
Click Discover Format and click Start in the window that opens. Discover Format should return XML as format.
Click Accept Format. StreamLab automatically configures the source:
Test this configuration using the Sample 5 Rows from Source button. StreamLab reads 5 rows from the source and stops. If no data arrives, check that you are still running the Stream Bus Data script and that you have entered the parameters correctly and try Sample 5 again.
Click the Up Arrow to exit the source, and up again to exit the Sources page.
With most real world data, you will need to do some formatting before you can work with it. Usually, this comes down to two things: changing the names of columns so that they are meaningful, and changing the data types for the columns.
Like databases, s-Server stores columns with a type. The data type lets s-Server understand how SQL will interact with the column's data. For example, you cannot perform numerical calculations on columns with a text data type.
You can use StreamLab Pipeline Guides to format the data.
Pipeline Guides are collections of commands, suggestions, and scripts that let you CREATE VIEWs on data sources. These views are composed of SQL. You use the Guide interface to generate this SQL. Begin by adding a new guide to StreamLab.
Guides are stored as collections of views in s-Server. VIEWs are reusable queries on a stream or table--in this case, on the stream called StreamLab_Output.line1.
If you want, you can type in a more meaningful prefix to be used in naming these views. You have now added the log file as a streaming source for the guide.
As you make changes, changes appear in the Script. As you add items to the Script, the streaming database in s-Server is modified.* You can always undo script changes.
StreamLab suggests changes that you might make to the data. As data comes into StreamLab, a piece of software called the Scrutinizer examines the data, looking for possible commands to apply to the data. For example, it might notice that the file contains regular commas, and suggest splitting the file into columns at every comma.
The table in the middle, the output view, shows the stream resulting from the steps in the script, the output of this Guide.
StreamLab, and s-Server, depend on a special column called rowtime to deliver accurately timed data. Often, as a developer, you will want to take a column from the existing data with a timestamp and "promote" it to rowtime.
Here, you want to get the timestamp from the _reportedat field.
To do so:
Note that the reported_at column has disappeared and that the ROWTIME column now lists timestamps beginning with 2014.
You are now working with the times at which the bus data was originally recorded, instead of the time it was written to the /tmp/buses.log file.
The gps column is currently in unparsed JSON. You can parse the JSON using the Parse JSON command.
To do so:
The Output View should now display lat, lon, and bearing as distinct columns.
In the next step, you will change the data type of the Speed column, so that StreamLab can analyze it as a number. (Remember, you cannot perform numerical calculations on columns with a text data type.)
Note that the lat and lon columns are outlined in green. This means that StreamLab thinks they are numbers.
In this case we only want to cast one column, so we'll do it by hand.
As you have been applying commands, StreamLab has been generating SQL.
SQL stands for Structured Query Language and has been a standard since 1986. SQLstream uses a slight variant of SQL called streaming SQL.
For all projects, you can view the SQL generated by the project, as well as the SQL for individual pages.
To do so, click the Goals and Scripts button in the upper right corner of StreamLab. . Note that StreamLab has created a series of views that make up the Guide. At this point, you have formatted your imported data, and you are ready to perform a calculation on it.
Now that you have formatted the imported data, you are ready to perform a calculation on the data. A calculation can be as simple as multiplying the value of a column by two, or pretty much as complex as you need.
Click on the Analytics tab at the top of ScreenLab. Note the group of commands listed:
Analytics let you apply a number of different analytics to the Output view in StreamLab. Some of these produce new columns, and others change selected columns themselves.
|Aggregate||This command lets you aggregate a column, including COUNT, SUM, and AVERAGE. You apply these over sliding or tumbling windows. Windows express an amount of rows defined by time or number of rows, over which the aggregation is applied.|
|Bollinger Bands||This command lets you create a new column that is a running average of another column over a given window of time. You can also add Bollinger bands to this running average. Bollinger Bands are a volatility indicator, typically used for stock prices.|
|Calculate||This command lets you perform an arithmetical calculation on a column, such as adding 2000 to the column or dividing the column in 2. Selected columns must have numerical values, such as DOUBLE.|
|Categorize||This command lets you take a continuous value, such as a bus speed, and replace it with strings such as "fast", "medium", or "slow" by applying conditions to the column.|
|Distinct||The Distinct operation lets you eliminate rows with duplicate columns over a specified timeframe.|
|Geo-Locate IP||This command lets you determine the latitude and longitude associated with IP addresses worldwide.|
|GroupRank||This command lets you group rows by the values in one column (group by) and then rank the rows within that group according to another column (rank by) across a window defined by you.|
|Join||Join lets you combine data in the current pipeline guide with data from another stream or table. They allow you to enrich data in the current view with data from another source, as long as that source shares a column with the source for the current guide.|
|Time Sort||This command uses a sliding time-based window of incoming rows to reorder those rows by the selected column or by ROWTIME in order to address time synchronization problems with data.|
|New Column||The New Column analytic lets you add a new column based on a custom SQL expression.|
Here, we're going to apply a simple calculation to the speed column that identifies buses going faster than 45 kilometers an hour.
To do so, you use an analytic called Categorize.
Now that you've formatted and applied a calculation to your data, the next step is to create a dashboard to display your data.
A dashboard is a graphical display of data. You create these by assigning them parameters, launching them from StreamLab, and then changing preferences for the dashboard.
Using the speed column that you categorized in the previous section, you can create a dashboard with a map that shows speeding buses as red and non-speeding buses as green.
To do so:
Click Map and Table This dashboard requires three columns: the latitude and longitude for each event and a value to be shown at that location.
You can leave map type at its default. Select the lat column for latitude and the lon column for longitude.
After entering information for the dashboard, click the + button in the Suggestions window to add the category to the Script.
Click the View Dashboard icon in the Script (it looks like an eye) to launch the dashboard. (You can always launch dashboards from the Script for any particular step in the Script.)
A dashboard appears, with two panels created. The same stream is open in both panels, the top panel shows it as a table while the bottom shows StreamLab's World Events map.
Now that you have accessed the log file through StreamLab, formatted the data's columns, and created a dashboard based on this data, you can share the dashboard with other users with the URL you copied into the clipboard.
You can test the dashboard by opening a new browser and pasting the dashboard address into the address bar.
You first need to start the StreamLab server. If you've installed StreamLab with a regular installation, there should be a folder on your desktop called StreamLab. Open this folder and click the Start StreamLab Server icon:
to launch StreamLab.
Now that the StreamLab server is running, open a terminal and enter the following:
Once you hit enter, an information window opens.
Keep this window open for the remainder of the exercise. As long as this window is open, sample data from buses in the Sydney area is flowing into a file located at /tmp/buses.log at 50 rows per second. This simulates a log file being continually updated. We're going to convert this log file into a stream.