Parsing Avro

s-Server parses Apache Avro data similarly to the way it parses JSON data. (Avro schemas are defined with JSON.) All columns are parsed out of hierarchical Avro records using a specified "PATH" for corresponding Avro fields in the Avro schema. This is the schema stored in all Avro payloads, which is used to parse each message. (Some sources, such as Kafka or AMQP, may break the Avro payload up into one or more serialized Avro records. s-Server handles this by allowing you to indicate that Avro schema may be separate from a given batch of Avro data.)

s-Server parses Avro messages into rows in a stream that you define. See the subtopic How s-Server Uses Paths to Parse JSON in Reading JSON for more details.

s-Server matches data types according to its Ingestion Rules for Avro.

You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

The s-Server trace log includes information on readers' and parsers' progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide.

Avro processing works like this:

  • s-Server receives a message from a data source such as the file system, kafka, or AMQP. Each message may contain one or more Avro records.
  • As with all data sources, s-Server reads the Avro records into a foreign stream. Each Avro record may get parsed into multiple rows of the stream.
  • s-Server parses these Avro records using an Avro path that you define for each column of the foreign stream. An Avro path describes the location of a field in an Avro record. Avro Paths are analogous to XPath expressions for XML files.

To implement the AVRO parser, you create a foreign stream for a source, as described in the topics under Reading from Other Sources. Under stream or column options, you need to set the following stream options:

  • PARSER This needs to be AVRO.
  • AVROSCHEMAFILE This option can either be an HTTP URL to fetch the schema or it can be a path to a file on server host machine or VM.
  • _SCHEMAHEADER is a required option to indicate if the Avro schema is embedded in the Avro data. This option needs to be set to false for data sources like Kafka or AMQP, where each message can be one or more serialized Avro records without a schema.
  • _ROWPATH This is the Avro path for the row to be found. The Avro parser uses a row path to find Avro objects containing a row, then the path for each column is used to find the value to be extracted.
  • \PATH Path for each column in the ECDA column set. This defaults to $... Here, a column named 'FOO' would have an option named FOO_PATH that defaulted to *$..FOO* which would return the first property named FOO under the Avro object found by the ROW_PATH.

In s-Server 5.2, messages (typically in message queues such as Kafka) may be tagged with AVRO schema fingerprints. Those fingerprints will be matched with that for the AVRO_SCHEMA_FILE. Messages will be skipped if fingerprints don't match that of the AVRO_SCHEMA_FILE.

Example

CREATE OR REPLACE SERVER "AVROREADERSERVER" TYPE 'FILE'
FOREIGN DATA WRAPPER ECDA;

CREATE SCHEMA "AVROSCHEMA";
SET SCHEMA '"AVROSCHEMA"';

CREATE FOREIGN STREAM "AVROSTREAM" (
slotnamevarchar(32),
slotname2 varchar(32)
)
SERVER "AVROREADERSERVER"
OPTIONS (
PARSER 'AVRO',
AVRO_SCHEMA_FILE '/schemas/2.1/fs.avsc',
SCHEMA_HEADER 'false',
ROW_PATH '$',
SLOTNAME_PATH '$.slots[0:].slotName',
SLOTNAME2_PATH '$.slots2[0:].slotName'
);

Foreign Stream Options for Parssing AVRO {parsing-avro}

Avro source value s-Server Data Type
All numeric types All numeric SQL types
string/utf8 CHAR/VARCHAR
bytes/fixed BINARY/VARBINARY or DECIMAL
Boolean boolean
Long TIMESTAMP (millisFromEpoch)
Int TIME (millisFromMidnight)
Int DATE (daysFromEpoch)

Ingestion Rules for Avro

This section describes data type mappings between Avro types and SQL types.

Note: Arrays can be flattened. An array is flattened if any column path refers to field(s) of array elements using the [0:] prefix in the path. Only one array field with a ROW_PATH prefix can be flattened in the schema.

All other data types can be parsed as stringified VARCHAR values.

Avro source value s-Server Data Type
All numeric types All numeric SQL types
string/utf8 CHAR/VARCHAR
bytes/fixed BINARY/VARBINARY or DECIMAL
Boolean boolean
Long TIMESTAMP (millisFromEpoch)
Int TIME (millisFromMidnight)
Int DATE (daysFromEpoch)

Where a mapping to corresponding SQL type is not listed in the table above, for all AVRO data types, we return a JSON string representation of the AVRO datum as a VARCHAR value.