Parsing JSON

You can read from JSON data sources that are continually updated (streaming) using s-Server’s Extensible Common Data framework. s-Server’s JSON parsing is based on the Jackson JSON processor. More information on this processor is found at https://github.com/FasterXML/jackson-docs. s-Server matches data types according to its Ingestion Rules for JSON.

You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

The s-Server trace log includes information on readers' and parsers' progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide.

JSON processing works like this:

  • s-Server receives a message from a data source such as the file system, kafka, or AMQP. Each message may contain one or more JSON records.
  • As with all data sources, s-Server reads the JSON records into a foreign stream. Each JSON record may get parsed into multiple rows of the stream.
  • s-Server parses these JSON records using a JSON path that you define for each column of the foreign stream. A JSON path describes the location of a field in a JSON record. JSON Path are analogous to XPath expressions for XML files.

To implement the JSON parser, you create a foreign stream for a source, as described in the topics under Reading from Other Sources. Under stream or column options, you need to set the following stream options:

  • PARSER This needs to be JSON.
  • ROW_PATH This is the JSON path for the row to be found. The JsonPath parser uses a row path to find JSON objects containing a row, then the path for each column is used to find the value to be extracted.
  • _PATH Optional for each column in the ECDA column set. This defaults to $... Here, a column named 'FOO' would have an option named FOO_PATH that defaulted to $..FOO which would return the first property named FOO under the JSON object found by the ROW_PATH.

How s-Server Uses Paths to Parse JSON

Both kinds of paths begin with '$' to indicate the root JSON object or array followed by either ..* (scan), a property key, or [] syntax indicating either a list of property keys or an array selection. A row pattern might be $[0:] which indicates all the objects in an array. Then a column pattern such as $..address would extract the text in the address property of the JSON object in the top level array.

For example, a JSON record like:

{
"book": "A brief History Of Time",
"author": "Stephan Hawking"
}

can be parsed into a stream with two columns, "book" and "author". To accomplish this, you define two JSON paths: $.book for the column "book" and $.author for the column "author". The character "$" indicates the start of the data you want to stream, and the character "." indicates the start of a specific JSON record. There are more complicated ways to express JSON paths for columns, but this is the basic way.

Here’s a more complicated example:

{
"books": [
{ "book": "A brief History Of Time",
"author": "Stephan Hawking"
},
{ "book": "The world is flat",
"author": "Tom Friedman"
},
{ "book": "A theory of justice",
"author": "John Rawls"
}
]
}

Here, each JSON record is an array of embedded JSON records. In this example, you first need to define a ROW_PATH, as '$.books[0:] s-Server uses "ROW_PATH" to determine the end of a row when parsing JSON records with embedded arrays. You then define the following column paths. Each path becomes a column in the stream:

Column name Column Path Name Path Definition
book book_PATH \$.books[0:].book
author author_PATH \$.books[0:].author

JSON paths can use a wild card '..' to indicate any arbitrary path prefix or suffix. The paths described above can also be defined simply as

ROW_PATH => '$.books[0:]'
book_PATH => '$..book',-- default path when not specified
author_PATH => '$..author',-- default path when not specified

Note: s-Server does not support referencing an array element with a specific subscript. It supports only generic references to "every" element of the array through the wildcard [0:].

In the example above, a single JSON record with an array of embedded JSON records gets parsed into three rows of the stream "books". Whenever there is an embedded array, it can be parsed into as many rows as the number of elements of the array.

The JSON parser creates a "tree" representation of a JSON record. Each field is a "leaf" node of the tree, while embedded JSON records or arrays are intermediate (non-leaf) nodes. As the JSON parser "walks" over this tree representation, "end of a row" event is generated when parser "walks back" to the node represented by "ROW_PATH" defined above.

For a row path, the parser may find a category like "books" or "alerts" multiple times. For a column path, the parse finds a value one time, such as "Great Expectations" or "Pride and Prejudice".

Unnesting an Array

The process of turning an embedded array into a stream of rows that correspond to elements of the array is called "unnesting" the array into the containing JSON record. s-Server can only unnest one array in a JSON record. Other arrays can be extracted as a single VARCHAR column. These restrictions are described at the end of this topic.

For example:

{
"store": "Amazon",
"books": [
{ "book": "A brief History Of Time",
"author": "Stephan Hawking"
},
{ "book": "The world is flat",
"author": "Tom Friedman"
},
{ "book": "A theory of justice",
"author": "John Rawls"
}

can be turned into a stream of four columns as follows

book author store report_date
"A Brief History of Time" "Stephan Hawking" "Amazon" "2020-04-01 00:00:00"
"A Theory of Justice" "John Rawls" "Amazon" "2020-04-01 00:00:00"

The JSON path for each column would be,

book_PATH => '$.books[0:].book',
author_PATH => '$.books[0:].author',
store_PATH => '$.store',
report_date_PATH => '$.report_date',
ROW_PATH => '$'

Note that "ROW_PATH" is simply '$' as each row is a result of cartesian product of

  • the fields parsed from each element of the books' array
  • the fields parsed from the array containing JSON record.

You can extract additional nested arrays as a single VARCHAR column, and then parse these using the Parser UDX. See the topic - Parser UDX for more information.

Customizing JSON Paths

You can also use a custom parser for COLUMN_PATH, by adding the following parameter to the stream or server definition:| CUSTOM_TYPEPARSER | Allows overriding of individual column's parsing. Specifies a fully qualified Java classname that implements com.sqlstream.aspen.namespace.common.TypeParser

Restrictions to Parsing JSON

There are some additional restrictions on parsing JSON records with arrays embedded at multiple levels.

Only one embedded array can be "unnested." The JSON parser will not unnest more than one embedded array. Arrays that are not being unnested can only be parsed as VARCHAR columns whose value is the entire array object as unparsed JSON text.

That is, JSON paths like '$..array1[0:]..array2[0:]..field' are not supported. In other words, JSON parser will not unnest more than one embedded arrays. If you specify JSON paths that requires unnesting multiple arrays, this will be reported as a SQL validation error. (In some cases, "multiple unnesting" can be determined only during execution and will be reported as a runtime error.)

For example, given the following JSON, the path definitions "col1_PATH" = '$..col1' & "col2_PATH" = '$..col2' would require unnesting multiple arrays, and would therefor be invalid.

{
"array1": [ { ..., "col1": 1234.56, ... }, ... ],
"array2": [ { ..., "col2": "val2", ... }, ... ],
"col3": "val3"
}

For cases where there are multiple embedded arrays, you can parse JSON records by unnesting one of those embedded arrays while passing other embedded arrays as "unparsed JSON text" and then parsing these using the Parser UDX.

For the JSON record above, streaming abstraction can be:

CREATE OR REPLACE json_stream (
"col1" DOUBLE,
"array2Col" VARCHAR(1024),
"col3" VARCHAR(128)
)
SERVER FileReaderServer
OPTIONS (
...,
"col1_PATH" '$.array1[0:].col1',
"array2Col" '$.array2'-- alternatively, '$.array2[0:]'
"col3" '$.col3',
"ROW_PATH" '$'
...
);

However, when ROW_PATH itself represents a path for an embedded array, s-Server will unnest another embedded array in its elements. That is, the path "col_PATH" = '$.array1[0:].array2[0:].col' is allowed even if it means unnesting of array1 & array2 as long as ROW_PATH is '$.array1[0:]'.

JSON and Ambiguous Paths

The JSON parser cannot handle ambiguous paths. Ambiguous paths that cannot be detected at stream definition time will return undefined results. Given the following array,

{
"operation": "someOperation",
"slots": [{"slotName": "1"}, {"slotName": "2"}],
"slots2": [{"slotName": "3"}, {"slotName": "4"}]
}

the following column path for a SLOTNAME column will be ambiguous, because '$..slotName' fits more than one of the nested arrays.

SLOTNAME_PATH '$..slotName'-- any field with a name 'slotName'.

For an ambiguous path, a field that matches the path last in a JSON record will be read as the value. If you define a stream with an ambiguous path, it will fail with an exception:

CREATE FOREIGN STREAM fs (
slotnamevarchar(32),
slotname2 varchar(32)
)
SERVER ...
OPTIONS (
PARSER 'JSON',
...,
ROW_PATH '$',
SLOTNAME_PATH '$.slots[0:].slotName',
SLOTNAME2_PATH '$.slots2[0:].slotName'
);

java.sql.SQLException: Cannot parse fields of embedded arrays "$.slots2[0:]" and "$.slots[0:]" simultaneously. One of these embedded arrays must be extracted as an 'unparsed' column of VARCHAR type by specifying its path as column path.

However, the following stream definition will parse slots[0:] as an unparsed value of the slots2 array:

CREATE FOREIGN STREAM fs (
slotnamevarchar(32),
slots2 varchar(64)
)
SERVER ...
OPTIONS (
PARSER 'JSON',
...,
ROW_PATH '$',
SLOTNAME_PATH '$.slots[0:].slotName',
SLOTS2_PATH '$.slots2[0:]'
);

You can parse nested the array 'slots2, with a cascaded view definition that invokes another instanace of JSONPath parser as a UDX, if necessary. See the topic Using the Parser UDX ) in this guide for more details.

Foreign Stream Options for Parsing JSON Data

Option Definition
PARSER This needs to be JSON.
ROW_PATH This is the JSON path for the row to be found. The JsonPath parser uses a row path to find JSON objects containing a row, then the path for each column is used to find the value to be extracted.
<COLUMN_NAME>_PATH Optional for each column in the ECDA column set. This defaults to \$..<COLUMN_NAME>. Here, a column named 'FOO' would have an option named FOO_PATH that defaulted to $..FOO which would return the first property named FOO under the JSON object found by the ROW_PATH.

Ingestion Rules for JSON

s-Server applies the following coercion rules when parsing JSON data. All NULL LITERALS are cast as NULL. For arrays and objects, s-Server will raise an error in all cases except for CHAR and VARCHAR, in which cases the source text of the array or object will be truncated or padded as necessary

JSON source value TRUE/FALSE literal. Number String
BIGINT, DECIMAL, DOUBLE, INT, SMALLINT, REAL, TINYINT TRUE → 1
FALSE → 0
Raise an exception if the JSON number lies beyond the maximum or minimum boundary of the target type. If the number has any decimal digits (including the vacuous .0) and it is being ingested into an integer column, then raise an error. If the string, without quotes, is a valid JSON number, then that number should be parsed and the rules for ingesting JSON numbers should be applied.
Otherwise (the unquoted value isn't a valid JSON number), an error is raised.
VARBINARY TRUE → Set the lowest bit of the resulting byte array and clear all other bits.
FALSE → Clear all bits in the resulting byte array.
For VARBINARY, coercions should result in a 1 byte array. Raise an error. If the string, without quotes, is a valid SQL Standard BINARY literal, then s-Server will ingest it, truncating and 0-filling as necessary. A Standard BINARY literal has the form X'...' where ... is a sequence of (case-insensitive) hex digits.
BOOLEAN TRUE → TRUE
FALSE → FALSE
0 → FALSE
all other numbers → TRUE
s-Server applies the rules of java.lang.Boolean.parseBoolean(). If the lowercased string is equal to "TRUE", then the result is true. Otherwise, the result is FALSE.
VARCHAR TRUE → "TRUE"
FALSE →"FALSE"
The result string is truncated as necessary in order to fit into the target value's width.
Slap double-quotes around the number and then ingest subject to the truncation/padding rules for strings. If the string won't fit in the target SQL character value, then the excess trailing characters should be discarded
TIME, TIMESTAMP Boolean coercion to TIME should raise an error just like boolean coercion to DATE and TIMESTAMP. Raise an error. OK if the string parses as a DATE/TIME/TIMESTAMP according to the liberal rules being implemented by Jack right now. Otherwise, raise an error.