Generating Test Data with DataGen

DataGen is a program which generates test data for SQLstream applications, and pushes the data into a defined s-Server stream using the SQLstream JDBC Driver as described in the JDBC Driver section of the Integration Guide. It is supplied either as part of the distributed SQLstream product or as part of the Client Tools download from the SQLstream website (via SQLstream--clienttools-linux.run or SQLstream-client-tools-VERSION-windows.exe).

DataGen uses two kinds of spec files written in XML, stored in a directory called “config”. These can be combined into the same XML file, though separating the two is a recommended process.

  • A TupleRule XML document tells Datagen what kind of data to generate, describing the rules for generating sample data for each column in a tuple (row). The format of a TupleRule document is defined by TupleRule.xsd.
  • A DataSet XML document tells Datagen how to deliver this data, in terms of timeframe, pattern, intensity, and so on, describing the rules for generating a sequence of tuples. The format of a DataSet document is defined by DataSet.xsd. A DataSet doc may either refer to an external TupleRule file (recommended) or may contain an embedded TupleRule element.

SQLstream s-Server ships with sample spec files, which are installed in $opt/sqlstream/VERSION/clienttools/datagen/conf/samples.

However, you will most likely want to write TupleRules and DataSets which mimic your real-world conditions. DataGen performs a validating parse of TupleRule and DataSet instance documents using the XSD schema definition documents. To take advantage of schema validation and self-documenting annotations in the XSDs, we recommend you use a a schema-aware authoring tool such as XMLSpy to create TupleRule and DataSet instance documents.

Executing DataGen

You execute DataGen from the command line as follows. Linux:

datagen.sh [options] rulefile rulefile ...

Windows:

datagen.cmd [options] rulefile rulefile ...

where “rulefile” is a spec file such as sales.bids.random.set.xml

DataGen Options

Options are specified from the command line and are defined below. In Linux typical use might be something like the following: datagen.sh -t -v -u jdbc:sqlstream:sdp://localhost:5570 conf/samples/sales.bids.fast.set.xmlA Windows example would work similarly, with “datagen.cmd” substituted for “datagen.sh”.

Option Parameters Explanation
-c, –config= <dir> Specifies filesystem path to local DataGen config directory.
–repeat <num> number of times to repeat the data set default is 1
–streamSuffixBegin <num> Specifies beginning value of the integer suffix for the stream name. Not commonly used.
–streamSuffixEnd <num> Specifies ending value of the integer suffix for the stream name. Not commonly used.
-f, –file= <filename> Specifies filesystem path to output CSV file. Only applicable if “csv” selected for sinktype.
-h, –help Displays help.
-n, –username= <username> Specifies username (if necessary) for connection to SQLstream server.
-p, –password= <password> Specifies password (if necessary) for connection to SQLstream server.
-s, –sink= <sinktype> Specifies data sink type: “stream”, “null”, or “csv"Default=stream.
-t, –timing Requests a final message with times & row rates, which goes to stdout.
-u, –uri= <uri> Specifies JDBC URI for SQLstream server.Example: jdbc:sqlstream:sdp://host:port
-v, –verbose Enable verbose output. This option requests a few start/stop/progress messages, which go to stdout.

XML spec files

XML files are specified from the command line after options, and can be one or one or more Tuple or DataSet rules.

<TupleRule>

The TupleRule determines the content and form of the date itself. You write these to mimic the kind of data you want to test in SQLstream s-Server. The tag itself contains two attributes

  • note:* Optional note to describe purpose or characteristics of this tuple rule, for change history and so on. Attributes: date, version.
  • sink:* Data sink specification. Attributes: schema (name for sink object), name (stream, view, table)

The following rules are enclosed within the <TupleRule> tag. Any of the following rules can be specified in any order. Attributes are defined in another table following.

Rule Description Attributes and Example
LongDataRule Rule for generating long data. CommonRuleAttributes: name, randomize, seed, wrap. LongRuleAttributes: first, last, step.Example: This rule would generate a series of values for a column named “shares”, starting at 300 and increasing in increments of 20 until it reaches 70.
TimeDataRule Rule for generating datetime data. CommonRuleAttributes: name, randomize, seed, wrap. TimeRuleAttributes: first, last, pattern, step. Example: Note: “*” tells Datagen to start with current time. This rule would generate a series of values for a column named “time”, starting at the current time when the rule is called and increasing in increments of 600000 milliseconds until it reaches 7200000 milliseconds from the starting time.
DoubleDataRule Rule for generating double-precision data. CommonRuleAttributes: name, randomize, seed, wrap. DoubleRuleAttributes: first, last, precision, step. Example:. This rule uses a double value in order to implement floating point variables (15.5, 16.5, 17.5, and so on). See https://www.homeandlearn.co.uk/java/double_variables.html for more information. This rule would generate a series of values for a column named “price”, starting at 15 and increasing in increments of .50 until it reaches 20. Note: Precision refers to the number of decimal points retained. If undeclared or zero, all decimal points are preserved.
TimeRule first, last, pattern, step Example: <TimeDataRule name=“time” first="” last="+7200000" step=“600000”/> Note: “*” tells Datagen to start with current time. This rule would generate a series of values for a column named “time”, starting at the current time when the rule is called and increasing in increments of 600000 milliseconds until it reaches 7200000 milliseconds from the starting time.
DoubleRule first, last, precision, step. Example: <DoubleDataRule name=“price” first=“15.00” last=“20.00” step=“0.50” precision=“2”/> This rule uses a double value in order to implement floating point variables (15.5, 16.5, 17.5, and so on). See https://www.homeandlearn.co.uk/java/double_variables.html for more information. This rule would generate a series of values for a column named “price”, starting at 15 and increasing in increments of .50 until it reaches 20. Note: Precision refers to the number of decimal points retained. If undeclared or zero, all decimal points are preserved.
StringDataRule Rule for generating string data. CommonRuleAttributes: name, randomize, seed, wrap. StringRuleAttributes: phraseCount Example: <StringDataRule name="_commentAlpha"> <item>A</item><item>B</item><br/><item>C</item></StringDataRule>
CompoundDataRule Rule for generating formatted data from multiple constituent rules. Name, format.CompoundDataRule also constituent DataRules enclosed in its tag.

Example:

<CompoundDataRule name="comment" format="sample comment">
     <StringDataRule name="\_commentAlpha">
       <item>A</item>
       <item>B</item>
       <item>C</item>
     </StringDataRule>
     <LongDataRule name="_commentNum" first="1" last="3" step="1"/>
   </CompoundDataRule>

This rule would generate a concatenated column called “comment”, with values “A1”, “B2”, and “C3”. |

Definition of Attributes for Above

Attribute Type Description Values
name STRING Name of column
randomize BOOLEAN Whether randomization should be done true => generate randomized data;false => generate linear data that might also wrap
seed LONG A positive randomization seed Only used if randomize is true.
wrap BOOLEAN Wrap to beginning of data range. true, falseOnly applies if randomize is false.
first Use the same type as in the rule being applied. Used as starting or minimum value in generating data values.
last Use the same type as in the rule being applied. Used as ending or maximum value.
step Use the same type as in the rule being applied. Step size (default is 1). 0
precision INT Precision refers to the number of decimal points retained. If undeclared or zero, all decimal points are preserved. < 0 means ALL (only applies to DoubleDataRule)

<TupleRule> Example

The rule below produces tuples with the following columns: time, ticker, shares, price, expiryMs, comment.

<?xml version="1.0" encoding="UTF-8"?>
<TupleRule
       xmlns="http://www.SQLstream.com/xml/datagen/0"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.SQLstream.com/xml/datagen/0 ../TupleRule.xsd">
   <note version="0.3" date="2020-06-18">linear rules for sales.bids stream</note>
   <sink schema="SALES" name="BIDS" />
   <!-- stream columns: time, ticker, shares, price, expiryMs, comment -->
   <TimeDataRule name="time" first="*" last="*+7200000" step="600000"/>
   <StringDataRule name="ticker">
       <item>SQLS</item>
       <item>ORCL</item>
       <item>MSFT</item>
       <item>ADBE</item>
   </StringDataRule>
   <LongDataRule name="shares" first="300" last="700" step="200"/>
   <DoubleDataRule name="price" first="15.00" last="20.00" step="0.50" precision="2"/>
   <LongDataRule name="expiryMs" first="600000" last="18000000" step="60000"/>
   <CompoundDataRule name="comment" format="sample comment %s-%d">
     <StringDataRule name="_commentAlpha">
       <item>A</item>
       <item>B</item>
       <item>C</item>
     </StringDataRule>
     <LongDataRule name="_commentNum" first="1" last="3" step="1"/>
   </CompoundDataRule>
</TupleRule>

DataGen and Rowtimes

When working with streams, ROWTIME is a special column. ROWTIME is typically a system-defined column of type TIMESTAMP, included in every stream definition. Rows in a stream always have monotonically ascending values for ROWTIME. However, you can also explicitly set a value for ROWTIME. When you do so, the explicit value of ROWTIME substitutes for the explicit system-defined ROWTIME column.

To specify a ROWTIME value, you will need to write a TimeDataRule with a column named ROWTIME. The name “ROWTIME” tells s-Server to substitute this value for the system-generated value.

Using fixed ROWTIMEs

When you set an explicit ROWTIME, you cause the sink (target) stream rowtime bounds to be set to a datetime in the past.

Before running DataGen with an explicitly defined ROWTIME, you should restart SQLstream s-Server to reset the sink stream’s rowtimes before running a DataSet with fixed rowtimes Otherwise, rows from the second run will be rejected as being older than the last ROWTIME as set by the 1st run. Restarting SQLstream leaves stream rowtime bounds unconstrained until the first row (or rowtime bound) has been processed.

Note: Even if you set ROWTIME to a future time, are used in the test, the start timestamp is still earlier than the end timestamp, and the SQLstream s-Server should be restarted to clear stream rowtime bounds.

You can also clear stream rowtimes by dropping and recreating the stream.

Using current ROWTIMEs

Current times can be expressed using simple contemporaneous date-time specs wherever a timestamp value is required: Current times can be expressed using simple contemporaneous date-time specs wherever a timestamp value is required:

Simple Spec Definition
“*” current-time when constructor runs.
“**” current-time each time a row is generated.
“*+nnnn” current-time plus delta in milliseconds.
“*-nnnn” current-time minus delta in milliseconds.

Attributes

The tag determines how data is delivered. The following attributes, specified in a DataSet document, describe the amount, pattern, and frequency of tuples (rows) generated. The following are all attributes of the <DataSet> tag.

Attribute Type Description
start DATE Date/time when DataGen should begin streaming data. If value is zero, DataGen will start immediately. Current times can be expressed using simple contemporaneous date-time specs wherever a timestamp value is required:Example:2020-10-22T21:30:00.0Z
stop DATE Date/time when DataGen should stop streaming data. If value is zero, DataGen will run forever or until tuple count is reached. Example:2020-10-22T21:30:00.0Z
commit BOOLEAN Auto or manual commit mode.True = Autocommit on
tupleStyle CHAR The tupleStyle enum must be one of {none, steady, random, bursty} These indicate the pattern of data generation: none. This generates no tuples, rowtime bounds (punctuation) are sent if punctStyle=TUPLE.steady: tuples delivered by a millisecond interval determined by tupleIntervalrandom: tuples delivered at random interval, n millis, where 0 <= n <= tupleIntervalbursty: tuples delivered in bursts set by burstCount. Interval between tuples in milliseconds is determined by tupleInterval. Interval between bursts in milliseconds is determined by burstInterval.
burstCount LONG Number of tuples per burst.Only applies if bursty is selected for tupleStyle.
burstInterval LONG Milliseconds between bursts. Only applies if bursty is selected for tupleStyle.
punctStyle STATIC ENUM The punctStyle enum must be one of {none, tuple, burst}. Whether and how to send a rowtime bound.none: no explicit rowtime bound (punctuation) sent.tuple: explicit rowtime bound (punctuation) sent after each tuple.burst: explicit rowtime bound (punctuation) sent after each burst.
finalPunct DATE Date/time of final rowtime bound. If value is zero, no final rowtime bound will be sent.
finalPause LONG Milliseconds between last tuple and stream close.
maximumRate LONG Maximum rate in tuples per second.

A DataSet document can also have a <note> tag, which has version and date attributes and contains a description of the dataset:

 <note version="0.1" date="2007-07-30">5000-row bursty test with linear data and burst punctuation</note>

A DataSet document can also link to an external TupleRuleFile document. We recommend defining TupleRules in external documents. This lets you define multiple datasets using the same TupleRule:

 <TupleRuleFile href="sales.bids.linear.row.xml"/>

Example

The following example defines a data set and links to an external TupleRule file.

<?xml version="1.0" encoding="UTF-8"?>
<DataSet
 xmlns="http://www.SQLstream.com/xml/datagen/0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.SQLstream.com/xml/datagen/0 ../DataSet.xsd"
 start="2006-10-22T21:30:00.0Z"
 stop="2006-10-22T21:30:30.0Z"
 commit="auto"
 tupleStyle="bursty"
 tupleCount="5000"
 tupleInterval="0"
 burstCount="25"
 burstInterval="500"
 punctStyle="burst"
 finalPause="2000" >
 <note version="0.1" date="2007-07-30">5000-row bursty test with linear data and burst punctuation</note>
 <TupleRuleFile href="sales.bids.linear.row.xml"/>
</DataSet>

Combined Example

You can combine <DataSet> and <TupleRule> in the same file (though this practice is not recommended because it provides less flexibility in terms of separating data format from data delivery.

<?xml version="1.0" encoding="UTF-8"?>
<DataSet
 xmlns="http://www.SQLstream.com/xml/datagen/0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation=
       "http://www.SQLstream.com/xml/datagen/0 ../../datagen/DataSet.xsd"
 start="2006-10-22T21:30:00.0Z"
 stop="2006-10-22T21:30:30.0Z"
 tupleStyle="steady"
 tupleCount="5000"
 tupleInterval="100"
 burstCount="0"
 burstInterval="0"
 punctStyle="none"
 finalPause="2000" >
 <note version="0.1" date="2020-07-10">5000 random bids on several tickers, 10/sec</note>
 <TupleRule>
   <note version="0.1"
         date="2020-07-10">random rules for sales.bids stream</note>
   <sink schema="SALES" name="BIDS" />
   <!-- stream columns: time, ticker, shares, price, expiryMs, comment -->
   <TimeDataRule name="time" randomize="true" seed="13" first="*" last="*+7200000"/>
   <StringDataRule name="ticker" randomize="true" seed="17">
       <item>SQLS</item>
       <item>ORCL</item>
       <item>MSFT</item>
       <item>ADBE</item>
   </StringDataRule>
   <LongDataRule name="shares" randomize="true" seed="19"
       first="300" last="700" step="100"/>
   <DoubleDataRule name="price" randomize="true" seed="23"
       first="15.00" last="200.00" step="5.00" precision="2"/>
 </TupleRule>
</DataSet>

DataGen Properties File

Using the properties file datagen.properties you can specify URI, username and password. For SQLstream these properties can usually be defaulted:

Property name Default Description
datagen.datasink.jdbc.uri jdbc:sqlstream: default sink (target) JDBC URI
datagen.datasink.jdbc.user sa username for server connections
datagen.datasink.jdbc.pass (none) password for server connections