Parsing Avro

s-Server parses Apache Avro data similarly to the way it parses JSON data. (Avro schemas are defined with JSON.) All columns are parsed out of hierarchical Avro records using a specified "PATH" for corresponding Avro fields in the Avro schema. This is the schema stored in all Avro payloads, which is used to parse each message. (Some sources, such as Kafka or AMQP, may break the Avro payload up into one or more serialized Avro records. s-Server handles this by allowing you to indicate that Avro schema may be separate from a given batch of Avro data.)

s-Server parses Avro messages into rows in a stream that you define. See the subtopic How s-Server Uses Paths to Parse JSON in Reading JSON for more details.

s-Server matches data types according to its Ingestion Rules for Avro.

Note: You can also input data in larger chunks and parse it later using the Parser UDX ). This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server.

The s-Server trace log includes information on readers' and parsers' progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide. These errors are also logged in the Global Error Stream.

Avro processing works like this:

  • s-Server receives a message from a data source such as the file system, kafka, or AMQP. Each message may contain one or more Avro records.
  • As with all data sources, s-Server reads the Avro records into a foreign stream. Each Avro record may get parsed into multiple rows of the stream.
  • s-Server parses these Avro records using an Avro path that you define for each column of the foreign stream. An Avro path describes the location of a field in an Avro record. Avro Paths are analogous to XPath expressions for XML files.

Foreign Stream Options For Parsing AVRO

To implement the AVRO parser, you create a foreign stream for a source, as described in the topics under Reading from Other Sources. Under stream or column options, you need to set the following stream options:

Options Description
PARSER This needs to be AVRO
AVRO_SCHEMA_LOCATION This option can either be an HTTP URL to fetch the schema or it can be a path to a file on server host machine or VM
SCHEMA_HEADER This option is required option to indicate if the Avro schema is embedded in the Avro data. This option needs to be set to false for data sources like Kafka or AMQP, where each message can be one or more serialized Avro records without a schema
ROW_PATH This is the Avro path for the row to be found. The Avro parser uses a row path to find Avro objects containing a row, then the path for each column is used to find the value to be extracted
PATH This represents the path for each column in the ECDA column set. This defaults to $... Here, a column named 'FOO' would have an option named FOO_PATH that defaulted to $.FOO which would return the first property named FOO under the Avro object found by the ROW_PATH

In s-Server, messages (typically in message queues such as Kafka) may be tagged with AVRO schema fingerprints. Those fingerprints (schema IDs) will be matched against the schema found at AVRO_SCHEMA_LOCATION. Messages will be discarded if the fingerprints don't match.

Example

CREATE SCHEMA "AVROSCHEMA";
SET SCHEMA '"AVROSCHEMA"';

CREATE FOREIGN STREAM "AVROSTREAM" (
slotnamevarchar(32),
slotname2 varchar(32)
)
SERVER "FILE_SERVER"
OPTIONS (
PARSER 'AVRO',
AVRO_SCHEMA_LOCATION '/schemas/2.1/fs.avsc',
SCHEMA_HEADER 'false',
ROW_PATH '$',
SLOTNAME_PATH '$.slots[0:].slotName',
SLOTNAME2_PATH '$.slots2[0:].slotName'
);

Ingestion Rules for Avro

This section describes data type mappings between Avro types and SQL types.

Note: Arrays can be flattened. An array is flattened if any column path refers to field(s) of array elements using the [0:] prefix in the path. Only one array field with a ROW_PATH prefix can be flattened in the schema.

All other data types can be parsed as stringified VARCHAR values.

Avro source value s-Server Data Type
All numeric types All numeric SQL types
string/utf8 CHAR/VARCHAR
bytes/fixed BINARY/VARBINARY or DECIMAL
Boolean boolean
Long TIMESTAMP (millisFromEpoch)
Int TIME (millisFromMidnight)
Int DATE (daysFromEpoch)

Where a mapping to corresponding SQL type is not listed in the table above, for all AVRO data types, we return a JSON string representation of the AVRO datum as a VARCHAR value.