The Streaming Data Source option lets you select a source that you know is streaming, from a file, HTTP, a WebSocket, network socket, AMQP message bus, Kafka topic, AWS Kinesis stream, or Teradata listener. StreamLab will automatically determine the format for the source.
To add a Streaming Data source:
Enter connection information for the input source. For example, to access a File source, you need to enter directory and filename pattern information for the file. By default, StreamLab uses the project schema for the new source. If you wish to use a different schema, click the dropdown menu to the right of Schema.Stream. You can also choose a different name for the stream by clicking the dropdown menu that reads “data_1”.
Click the Discover Format button. This feature examines the file to determine its file format. Currently, the Discovery parser can identify CSV, XML, JSON, and Avro files. StreamLab can also work with ProtoBuf files, but you need to add these as their own source. Avro files may require additional configuration to work.
The Discover Format dialog box opens. You can select an amount for the Discover Format feature to read in bytes and a timeout for the feature. See Troubleshooting Discovery below. In most cases, defaults should device.
Click Start. The Discover Format feature runs. The left section of the dialog box should display a format–either CSV, JSON, XML, or Binary.
Click Accept. The indicated format should be automatically selected under Format. You can also choose the Line format, which lets you access files line-by-line.
Next, fill in the list of columns and their SQL types. You can use the Clipboard to copy column names and types from another form.
Test the source by clicking the Sample 5 Rows from Source button.
Click the Go Up arrow to exit the Edit Source page.
The Sample Bytes field determines how many bytes Discovery reads before analyzing the input. This number can greatly affect Discovery’s performance. If you set it too high and your data is coming in too slowly, you won’t see any response from Discovery until it has read these bytes. If you set it too low and it’s smaller than the size of a record in your input, Discovery will have difficulty determining your file’s format. A good rule of thumb is to set Sample Bytes to about 5X the size of a record in your input, so that Discovery sees multiple records and can make a better guess as to the data types of the columns it finds. For example, if each record is 80 bytes, it would make sense to set Sample Bytes at 4096.
If you know or suspect that your streaming data source is Apache Avro, you may need to take additional steps to configure this source.
Before running Discover Format, select Binary for the Format option.
Under Binary Format, select AVRO.
If you know your Avro payload has a schema, check the This Payload Has a Schema String as a Prefix box and enter the location of the schema for the AVRO Schema Location option. AVRO_SCHEMA_FILE has been changed to AVRO_SCHEMA_LOCATION. This option can either be a http URL to fetch the schema or it can be a path to a file on the server host machine or VM.
Note: If you do not select Binary as format, discovery may either recommend “UNKNOWN” or return a CSV with a single column.