Using the Discovery Parser

If you do not have information about the file format to be read, or need help filling in foreign stream or agent properties for a file format, you can use the Discovery parser to determine information about a file’s format. When you point it at a file location, this parser reads a sample of the file and returns information on file format options. Currently, the Discovery parser can identify CSV, XML, JSON, and Avro files.

Avro files may sometimes return a result of ‘BINARY’ or ‘UNKNOWN’ if the Discovery parser cannot find an Avro schema. In these cases, you can specify an Avro schema for the Discovery parser. See Using the Discovery Parser with Apache Avro below.

The Discovery parser inspects sample data from an indicated data source to recommend values for a parser. The plugin also recommends all necessary relevant options for the “discovered” parser. Some recommended options can be column specific for columns that you define in the foreign stream definition.

The Discovery parser returns stream and column options as a table in s-Server, which you query to get information on options for the file type. You can then use these options to write a foreign stream or agent properties file to parse data from the source. See Parser Types for Reading for an overview of how to parse data from these sources.

Using the Discovery Parser on a source file

To use the Discovery parser, you first set up a foreign table with the options below in s-Server, with options that set the parser type as “Discovery”, indicate the file’s location, and specify a regex pattern for the file name. The following code first creates a server for the table, then creates and sets a schema, then creates the foreign table with required options.


CREATE OR REPLACE SCHEMA "Discovery";
SET SCHEMA '"Discovery"';

CREATE OR REPLACE FOREIGN TABLE discovery_table (
  "d_name" VARCHAR(1024), -- recommended name for a column
  "d_path" VARCHAR(1024), -- recommended "PATH" to extract the column
  "d_type" VARCHAR(1024), -- recommended data type for the column
  "d_precision" INTEGER,  -- recommended precision for the column
  "d_scale" INTEGER,      -- recommended scale for the column
  "d_nullable" BOOLEAN,   -- is the column nullable ?
  "d_sample" VARCHAR(4096), -- a sample value for the recommended column
  "d_properties" VARCHAR(4096) -- any column-specific options for the column
                               -- options are space-separated
)
SERVER "FILE_SERVER"
OPTIONS
(
  DIRECTORY '/home/sqlstream/',
  FILENAME_PATTERN 'y\.txt',
  DISCOVERY_TIMEOUT '1000',
  MAX_EXAMPLE_BYTES '1024',
  PARSER 'DISCOVERY'
);

Once you have created this foreign table, you run a simple SELECT query against the table to invoke the Discovery parser:

SELECT * FROM discovery_table;

The query inspects data from /data/samples/discovery1.log to discover data formats among CSV, JSON and XML. If the discovery process fails to discover one of these data formats, then it returns ‘UNKNOWN’ as the recommended “PARSER” option.

Results of the query return one row for each “recommended” column for the actual foreign stream to be created. A brief description of each column is in in-line comments.

Last row includes options that are not specific to any recommended columns in “d_properties” column.

An example result of a simple SELECT query:

SELECT * FROM discovery_table;
'd_name','d_path','d_type','d_precision','d_scale','d_nullable','d_sample','d_properties'
'Field1','$.Field1','SMALLINT','4','0','false','1231','Field1_PATH=$.Field1'
'Field2','$.Field2','TIMESTAMP','0','0','false','2016-01-01 23:11:23.653','Field2_DATE_FORMAT=yyyy-MM-dd HH:mm:ss
Field2_PATH=$.Field2'
'Field4','$.Field3.Field4','SMALLINT','4','0','false','1233','Field4_PATH=$.Field3.Field4'
'Field5','$.Field3.Field5','SMALLINT','4','0','false','1234','Field5_PATH=$.Field3.Field5'
'Array1','$.Array1[0:]','VARCHAR(1024)','0','0','false','[]','Array1_PATH=$.Array1[0:]'
'Field8','$.Array2[0:].Field8','SMALLINT','5','0','false','12313','Field8_PATH=$.Array2[0:].Field8'
'Field9','$.Array2[0:].Field9','SMALLINT','5','0','false','12314','Field9_PATH=$.Array2[0:].Field9'
'Array3','$.Array2[0:].Array3[0:]','VARCHAR(1024)','0','0','false','[]','Array3_PATH=$.Array2[0:].Array3[0:]'
'Field11','$.Field11','SMALLINT','5','0','false','12321','Field11_PATH=$.Field11'
'Array4','$.Field12.Array4[0:]','VARCHAR(32)','0','0','true','[12322,12323,12324]','Array4_PATH=$.Field12.Array4[0:]'
'Field13','$.Field12.Field13','SMALLINT','5','0','false','12325','Field13_PATH=$.Field12.Field13'
'Field14','$.Field14','DECIMAL','8','4','false','1238.9824','Field14_PATH=$.Field14'
'','$','','','','','','PARSER=JSON
Field8_PATH=$.Array2[0:].Field8
Field5_PATH=$.Field3.Field5
Field2_PATH=$.Field2
Field9_PATH=$.Array2[0:].Field9
ROW_PATH=$
CHARACTER_ENCODING=UTF-8
Array3_PATH=$.Array2[0:].Array3[0:]
Field13_PATH=$.Field12.Field13
Array4_PATH=$.Field12.Array4[0:]
Field4_PATH=$.Field3.Field4
Field14_PATH=$.Field14
Field2_DATE_FORMAT=yyyy-MM-dd HH:mm:ss
Array1_PATH=$.Array1[0:]
Field1_PATH=$.Field1
Field11_PATH=$.Field11'
13 rows selected (0.172 seconds)

The second to last row contains various options for the foreign stream for the recommended PARSER, which in this case is ‘JSON’. Options like CHARACTER_ENCODING, and ROW_PATH are required for this parser.

Similarly, if the Discovery parser had identified ‘CSV’ as the parser, it would have recommended options such as SEPARATOR, ROW_SEPARATOR, and QUOTE_CHARACTER.

The last row of the discovery result returns a “NUM_BYTES_READ”, which is the number of bytes read and inspected by the Discover parser.

Using the Discovery Parser with another source

To discover details of the data format from a different source type, simply replace FILE_SERVER in the example with the appropriate pre-built server, or create your own SERVER for the source.

Sample Properties Implementing ECD Agent to Use the Discovery Parser

To discover files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

# Location, name, and type of file
DIRECTORY=/tmp
filename_pattern=buses\.log
SEPARATOR=,
PARSER=DISCOVERY
DISCOVERY_TIMEOUT=1000
MAX_EXAMPLE_BYTES=1024
CHARACTER_ENCODING=UTF-8
# Schema, name, and parameter signature of destination table
SCHEMA_NAME=Discovery
TABLE_NAME=discoveryTable
ROWTYPE=RECORDTYPE(VARCHAR(1024) d_name, VARCHAR(1024) d_path, VARCHAR(1024) d_type, INTEGER d_precision , INTEGER d_scale, BOOLEAN d_nullable, VARCHAR(4096) d_sample, VARCHAR(4096) d_properties)

Using the Discovery Parser with Avro

Apache Avro is a popular, compact data format predominantly used in Hadoop and other big data platforms. See the topic Reading Avro for detailed documentation of how s-Server parses messages in Avro format into stream rows.

The discovery process for Avro data format is quite different from for other text based data format like CSV or XML. For these text based data formats, the Discovery parser uses built-in heuristics to create object-relational mapping.

Since Avro is a binary data format, Discovery handles Avro differently.

Avro files typically embed a schema for all the data stored in those files. Using the Avro API, the Discovery parser first checks to see if the Avro schema is embedded in the sample data being collected from the data source. If the Discovery parser finds the Avro schema in the sample data, it uses that schema to recommend object relational mapping for the Avro data.

If the schema is not embedded in the sample data, the Discovery parser will return ‘UNKNOWN’ or ‘BINARY’ as a recommended parser. If you get this result, you can specify an AVRO_SCHEMA_LOCATION option in Foreign Stream options and try using the Discovery parser again. See CREATE FOREIGN STREAM in the SQLstream Streaming SQL Reference Guide.

Once you specify the AVRO_SCHEMA_LOCATION, the Discovery parser uses the referenced schema file to recommend object relational mapping for Avro data.

In either case, the Discovery parser recommends data type mapping between AVRO & SQL types. See Avro Ingestion Rules in the Reading Avro topic for more details. Discovery may recommend that one of the embedded arrays be flattened. The simple heuristic is to recommend the array with most number of fields in each array element.