s-Server parses Apache Avro data similarly to the way it parses JSON data. (Avro schemas are defined with JSON.) All columns are parsed out of hierarchical Avro records using a specified "PATH" for corresponding Avro fields in the Avro schema. This is the schema stored in all Avro payloads, which is used to parse each message. (Some sources, such as Kafka or AMQP, may break the Avro payload up into one or more serialized Avro records. s-Server handles this by allowing you to indicate that Avro schema may be separate from a given batch of Avro data.)
s-Server parses Avro messages into rows in a stream that you define. See the subtopic How s-Server Uses Paths to Parse JSON in Reading JSON for more details.
s-Server matches data types according to its Ingestion Rules for Avro.
Note: You can also input data in larger chunks and parse it later using the Parser UDX ). This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server.
The s-Server trace log includes information on readers' and parsers' progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide. These errors are also logged in the Global Error Stream.
Avro processing works like this:
To implement the AVRO parser, you create a foreign stream for a source, as described in the topics under Reading from Other Sources. Under stream or column options, you need to set the following stream options:
Options | Description |
---|---|
PARSER | This needs to be AVRO |
AVRO_SCHEMA_LOCATION | This option can either be an HTTP URL to fetch the schema or it can be a path to a file on server host machine or VM |
SCHEMA_HEADER | This option is required option to indicate if the Avro schema is embedded in the Avro data. This option needs to be set to false for data sources like Kafka or AMQP, where each message can be one or more serialized Avro records without a schema |
ROW_PATH | This is the Avro path for the row to be found. The Avro parser uses a row path to find Avro objects containing a row, then the path for each column is used to find the value to be extracted |
PATH | This represents the path for each column in the ECDA column set. This defaults to $.. |
In s-Server, messages (typically in message queues such as Kafka) may be tagged with AVRO schema fingerprints. Those fingerprints (schema IDs) will be matched against the schema found at AVRO_SCHEMA_LOCATION. Messages will be discarded if the fingerprints don't match.
CREATE SCHEMA "AVROSCHEMA";
SET SCHEMA '"AVROSCHEMA"';
CREATE FOREIGN STREAM "AVROSTREAM" (
slotnamevarchar(32),
slotname2 varchar(32)
)
SERVER "FILE_SERVER"
OPTIONS (
PARSER 'AVRO',
AVRO_SCHEMA_LOCATION '/schemas/2.1/fs.avsc',
SCHEMA_HEADER 'false',
ROW_PATH '$',
SLOTNAME_PATH '$.slots[0:].slotName',
SLOTNAME2_PATH '$.slots2[0:].slotName'
);
This section describes data type mappings between Avro types and SQL types.
Note: Arrays can be flattened. An array is flattened if any column path refers to field(s) of array elements using the
All other data types can be parsed as stringified VARCHAR values.
Avro source value | s-Server Data Type |
---|---|
All numeric types | All numeric SQL types |
string/utf8 | CHAR/VARCHAR |
bytes/fixed | BINARY/VARBINARY or DECIMAL |
Boolean | boolean |
Long | TIMESTAMP (millisFromEpoch) |
Int | TIME (millisFromMidnight) |
Int | DATE (daysFromEpoch) |
Where a mapping to corresponding SQL type is not listed in the table above, for all AVRO data types, we return a JSON string representation of the AVRO datum as a VARCHAR value.