Parsers

The Parsers commands let you apply predefined parsers to a stream of data. These are especially useful with log files.

Parse Timestamp

The Parse Timestamp lets you create a timestamp out of any correctly formatted input string of CHAR, VARCHAR2, NCHAR, or NVARCHAR2 datatype to a TIMESTAMP datatype.

You can specify which parts of the timestamp string you wish to use in subsequent processing, and create a TIMESTAMP value containing only those. To do so, you specify a template that identifies the parts of the timestamp you want. The input date-time string can contain any parts of a full timestamp (‘yyyy-MM-dd hh:mm:ss’). If all these elements are present in your input string, and ‘yyyy-MM-dd hh:mm:ss’ is the template you supply, then the input-string elements are interpreted in that order as year, month, day, hour, minute, and seconds, such as in ‘2009-09-16 03:15:24’. The yyyy cannot be uppercase; the hh can be uppercase to mean using a 24-hour clock.

Those fields in the resulting TIMESTAMP will then contain the corresponding data taken from your input-date-time string; fields not specified in your template will use default values (see below). The format of the template used by SYS_BOOT.MGMT.CHAR_TO_TIMESTAMP is defined by the Java SimpleDateFormat class. See also the Date and Time patterns discussion in this SQLstream SQL Reference Guide.

Regex Split

The Regex Split command parses (separates) a character string based on Java Regular Expression patterns as defined in java.util.regex.pattern. For more information on using regular expressions, see http://docs.oracle.com/javase/tutorial/essential/regex/

The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).

To use the Regex Split parser, you select the column to which you want to apply a Java Regular Expression pattern, then enter the pattern in the regular expression box. This topic provides a cursory explanation of RegEx, but you may want to consult an expert for help in creating RegEx strings. For example, the following Regular Expression returns two columns with zero or more of [0-9] from the string:

([^0-9]*)1*([^0-9]*)2*([^0-9]*)

So if the selected column contained a string like

'abcde111fghij22klm'

the columns returned would be the following:

         +----------+-----------+
         | COLUMN1  | COLUMN2   |
         +----------+-----------+
         | 111      | 22        |
         +----------+-----------+
         1 row selected

If the Regular Expression you enter is invalid, StreamLab returns an error.

You can select from Full or Fast parsing. Full is more accurate, while Fast, as it suggests, works more quickly.

Full

Columns are based on match groups defined in the regex-pattern. Each group defines a column, and the groups are processed from left to right. Failure to match produces a NULL value result: If the regular expression does not match the the string passed as the first parameter, NULL is returned.

Fast

The FAST_REGEX_LOG_PARSE works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression.

If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, “lazy” means don’t parse more than you need to at each step. “Greedy” means parse as much as you can at each step.)

Quick Regex Reference

For full details on Regex, see java.util.regex.pattern

  • [xyz] Find single character of: x, y or z
  • [^abc] Find any single character except: x, y, or z
  • [r-z] Find any single character between r-z
  • [r-zR-Z] Find any single character between r-z or R-Z
  • \^ Start of line
  • $ End of line
  • \A Start of string
  • \z End of string Any single character
  • \s Find any whitespace character
  • \S Find any non-whitespace character
  • \d Find any digit
  • \D Find any non-digit
  • \w Find any word character (letter, number, underscore)
  • \W Find any non-word character
  • \b Find any word boundary
  • (…) Capture everything enclosed
  • (x | y) Find x or y (also works with symbols such as \d or \s)
  • x? Find zero or one of x (also works with symbols such as \d or \s)
  • x* Find zero or more of x (also works with symbols such as \d or \s)
  • x+ Find one or more of x (also works with symbols such as \d or \s)
  • x{3} Find exactly 3 of x (also works with symbols such as \d or \s)
  • x{3,} Find 3 or more of x (also works with symbols such as \d or \s)
  • x{3,6} Find between 3 and 6 of x (also works with symbols such as \d or \s)

Parse W3C Log

W3C_LOG_PARSE supports access to logs generated by W3C-compliant applications like the Apache web server, producing output rows with one column for each specifier. The data types are derived from the log entry description specifiers listed in the Apache mod_log_config specification.

You can choose from one of five pre-defined parsers. Each parser breaks the log into predefined columns with one column for each format specifier. The column’s type is determined by the specifier’s output. For example, %b represents the number of bytes sent in processing an HTTP request, so the column type will be numeric.

Format Name W3C Name Format Specifiers
COMMON Common Log Format (CLF) %h %l %u %t “%r” %>s %b
COMMON WITH VHOST Common Log Format with Virtual Host %v %h %l %u %t “%r” %>s %b
NCSA EXTENDED NCSA extended/combined log format %h %l %u %t “%r” %>s %b “%[Referer]i”“%[User-agent]i”
REFERER Referer log format %[Referer]i —> %U

Parse CDR

CDRs are Caller Data Records. StreamLab provides four built-in parsers for these:

  • SONUS-Common.
  • SONUS-Stop.
  • Broadsoft.
  • Genband

You can customize the delimiter, escape, and quotation characters for each of these formats.