Parsing W3C Data

The W3C option lets you parse logs generated by W3C-compliant applications. You describe file entries using data specifiers defined in the Apache mod_log_config documentation. The W3C parser uses the W3C parser function, described in the topic W3C_LOG_PARSE in the SQLstream Streaming SQL Reference Guide*. That function can be used anywhere in your code. The W3C parser for the Extensible Common Data Framework lets you parse W3C log data as it comes into s-Server. Doing so may be desirable for performance or other reasons.

To use the Extensible Common Data Adapter with W3C files, you set parser to W3C, then pass in groups of filters that will map to columns. The W3C parser takes one additional property, FORMAT*, which takes data specifiers defined in the Apache mod_log_config documentation.

Examples of these are provided below.

Column names cannot be dynamically assigned with W3C files. You need to declare these as part of a the foreign stream or table.

You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

The s-Server trace log includes information on readers' and parsers' progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide.

Note: SQLstream handles Apache log format specifiers without alteration.

Foreign Stream Options for Parsing W3C Files

Option Definition
FORMAT Format specification, such as "%h %l %u %t "%r" %>s %b".

See http://httpd.apache.org/docs/current/mod/mod_log_config.html

Examples of Commonly Used Log Format Strings

Format Name W3C Name Format Specifiers
COMMON Common Log Format (CLF) %h %l %u %t "%r" %>s %b
COMMON WITH VHOST Common Log Format with Virtual Host %v %h %l %u %t "%r" %>s %b
NCSA EXTENDED NCSA extended/combined log format %h %l %u %t "%r" %>s %b "%[Referrer]i" "%[User-agent]i"
REFERRER Referrer log format %[Referrer]i ---> %U

Sample Foreign Stream to Parse W3C Files

The following example will parse columns called "ip", "ident", "userId", "reqTime", "reqMethod", "reqLine", and "httpVer" from a file in /tmp.

Note: Information on file location, file name pattern and character encoding can also be set as server options.

CREATE OR REPLACE FOREIGN DATA WRAPPER MOZILLA_ECDA
LIBRARY 'class com.sqlstream.aspen.namespace.common.CommonDataWrapper'
LANGUAGE JAVA;

CREATE OR REPLACE SERVER "mozilla_server"
TYPE 'FILE'
FOREIGN DATA WRAPPER MOZILLA_ECDA;

CREATE OR REPLACE FOREIGN STREAM "mozilla"."BaseLogStream"
("ip" VARCHAR(15),
"ident" VARCHAR(5),
"userId" VARCHAR(5),
"reqTime" VARCHAR(26),
"reqMethod" VARCHAR(7),
"reqLine" VARCHAR(256),
"httpVer" VARCHAR(5)

)
SERVER "mozilla_server"
OPTIONS (
directory '/tmp',
filename_pattern 'access_\d{4}(-\d\d){3}(\.\d+)?',
encoding 'UTF-8',
parser 'W3C',
format '%h %l %u [%t] \"%r %r HTTP/%r\" %>s %b \"%r\" \"%r\"');

Sample Properties Implementing ECD Agent to Parse W3C Files

To parse W3C files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(15) ip, VARCHAR(5) ident, VARCHAR(5) userId, VARCHAR(26) reqTime, VARCHAR(7) reqMethod, VARCHAR(256) reqLine, VARCHAR(5) httpVer)
DIRECTORY=/tmp
FILENAME_PATTERN=access_\d{4}(-\d\d){3}(\.\d+)?
CHARACTER_ENCODING=UTF-8
SKIP_HEADER=TRUE
SEPARATOR=u\000A
parser=W3C
format=%h %l %u [%t] \"%r %r HTTP/%r\" %>s %b \"%r\" \"%r\"