Parsing XML

To read XML data with the ECD Adapter and Agent (ECDA Framework), you need to specify a location for the data to be read and information on the data structure.

Overview

The XML parser converts data from the XML file format into regular SQLstream rows. To read data from XML files using the ECD Adapter or Agent, you define a foreign stream based on the input sources that uses the XML parser, or, with the ECDA agent, define a properties file with similar options. As with other input data formats, you specify the location of the files and other attributes that control how the files are discovered and read. Like any stream definition, the foreign stream has a definite row-type: a set of columns with names and data-types.

XML is a complex and flexible format, so it is also necessary to specify how a portion of the XML input is recognized as the source of one row, and how parts of that XML input are recognized as the column values of that row. This is done by pattern matching, with patterns defined as simple xpath expressions. There is an xpath that matches a row, and an xpath for each column. More information on XPath can be found at http://en.wikipedia.org/wiki/XPath.

See Ingestion Rules for XML ) for details on how s-Server parses XML data.

Note: You can also input data in larger chunks and parse it later using the Parser UDX ). This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server.

The s-Server trace log includes information on readers’ and parsers’ progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide.

How it Works

As input files are read, their contents are appended to a long stream comprised of XML data. As data flows in, s-Server parses the data, watching for xpath matches. Whenever the row xpath matches a chunk of input, s-Server emits an output row has been found. Inside this chunk, as each column xpath matches, the value of that column has been found as text.

Only simple xpaths are allowed as patterns. A pattern may be absolute (starts with a slash) or relative (doesn’t start with a slash).

s-Server supports patterns built from tag-names (like /a/b/c), attributes (like /a/b/c@d), and simple conditions (like /a/b/c[@d=“foo”]).

s-Server supports patterns built from tag-names (like /a/b/c), attributes (like /a/b/c@d), and simple conditions (like /a/b/c[@d=“foo”]).

A pattern like /a/b/c matches tag <c> when nested inside a <b> inside an <a>. The value of the match is the text inside the tag: all the text between this and the matching . Leading and trailing whitespace are stripped.

A pattern like /a/b/c@d is almost the same: it matches tag <c> inside a <b> inside an <a>, but the value of match is the d attribute of the tag. That is to say, is a match with the test value foo (no quotes, and not stripped of whitespace). If there is no attribute d, there is no match - ie the value is null.

s-Server also supports simple conditions in xpaths, namely [@a] (which means that attribute a is present) and [@a=x] (which means that attribute a is present and has the value “x’). Here x must be a string literal. So the path /a/b[@type=“cheese”]/c is like /a/b/c, but with the added requirement that the <b> element found has the attribute type=“cheese”.

An absolute xpath for a column means start matching from the xml document root. Note that the input stream is a series of well balanced xml documents, so the root occurs over and over again.

A relative xpath for a column means start matching from the start of the row. The row xpath itself must be absolute.

Note: More complicated xpaths are not supported. In particular, xpath conditions that test calculated expressions are not supported. You can get the same effect by filtering the foreign stream with a SQL WHERE clause. See the topic WHERE Clause of the SELECT Statement ) in the SQL Reference Guide for more details.

Note: You cannot have two columns in the same stream defined by exactly the same pattern. But two xpath patterns can be similar or overlap.

Nested Rows

It is common for an XML item to contain one or more nested subitems. This is naturally represented in SQL as two foreign streams parsed from the same input files: a stream of the main items, and a related stream of the subitems.

On the SQL side, the streams are tied together by

  • a primary key column (or columns) in the main stream
  • a foreign key in the subitem stream.

To produce this, just use the same xpath for the primary key and the foreign key. For each subitem, the value of the foreign key will be found outside the subitem xml, inside the main item xml. Hence the xpath must absolute, and not based on the subitem row. However, the foreign key must occur in the main item XML before any subitems.

Sample SQL

Here is a very simple example that illustrates the above points.

Note: The examples below use the file system as an input system. To parse XML over other systems, such as Kafka, Kinesis, a network socket, a WebSocket, HTTP or AMQP, you would need to specify options for these formats. See Reading from Other Sources ) for more details.

The XML input looks like:

<row><name>Elmer Fudd</name><occupation>hunter</occupation></row>
<row><name>Bugs Bunny</name><occupation>wabbit</occupation></row>

the foreign server was defined as:

CREATE SERVER FileReaderServer TYPE 'FILE'
FOREIGN DATA WRAPPER ECDA;

and the stream definition is

CREATE FOREIGN STREAM PLAYERS (
namevarchar(16),
occupation varchar(16)
) SERVER FileReaderServer OPTIONS (
DIRECTORY '/data/cartoons', FILENAME_PATTERN '.*xml',
PARSER 'XML',
PARSER_XML_ROW_TAGS'/row',
name_xpath 'name',
occupation_xpath 'occupation'
);

A Bigger Example

Here is a more realistic example, taken from an actual application. The input data is a record of retail sales transactions. Each transaction contains several line items, some of which are things purchased, but others are only cash register actions. The xml data is very detailed, and we will pick out a few (defined as xpaths).

We will produce two related streams, a stream of sales transactions and a stream of line items. To tie them together, we recognize some identifier fields in the sales transactions and use the same values as foreign keys in the stream of line items. First, a sample input data file, much abridged. Skip past this and refer back from the explanation of the DDL:

<POSLog xmlns:tri="http://www.triversity.com/TE/integration/">
<Header>
<MessageId />
<Timestamp>2020-05-15T11:47:05</Timestamp>
<Originator>TE TransactionPostWorkflow</Originator>
<SequenceNumber>1</SequenceNumber>
</Header>
<Body>
<RetailStoreID>Triversity.1</RetailStoreID>
<WorkstationID>POS.1.1</WorkstationID>
<SequenceNumber>198</SequenceNumber>
<BusinessDayDate>2020-04-26</BusinessDayDate>
<BeginDateTime>2020-05-15T11:46:12</BeginDateTime>
<EndDateTime>2020-05-15T11:47:05</EndDateTime>
<OperatorID>222</OperatorID>
<CurrencyCode>USD</CurrencyCode>
<TillID>3</TillID>
<TillSupervisor>222</TillSupervisor>
<LineItem>
<SequenceNumber>2</SequenceNumber>
<EndDateTime>2020-05-15T11:46:12</EndDateTime>2</tri:LineNumber>Item</tri:LineType>SELL</tri:ActionCode>
</tri:NativeLineDetail>
<SupplementalData>
<ITEM_KEY type="String">12345000001</ITEM_KEY>
<QUANTITY type="BigDecimal">3</QUANTITY>
<_datasource_ type="String">Keyboard</_datasource_>
<msrpPrice type="Money">0.00</msrpPrice>
<CUSTOMER_PROFILE_ID type="String" />
</SupplementalData>
<EntryMethod>Keyed</EntryMethod>
<Sale ItemType="Stock">
<POSIdentity>
<POSItemID>12345000001</POSItemID>
</POSIdentity>
<MerchandiseHierarchy Level="Department">
2221
</MerchandiseHierarchy>
<Description>SILK TIE</Description>
<RegularSalesUnitPrice>45.00</RegularSalesUnitPrice>
<ActualSalesUnitPrice>45.00</ActualSalesUnitPrice>
<ExtendedAmount>135.00</ExtendedAmount>
<Quantity Units="1" UnitOfMeasureCode="Each">
3
</Quantity>
<Tax>
<TaxGroupID>TG1</TaxGroupID>
<TaxableAmount>-1</TaxableAmount>
</Tax>
</Sale>
</LineItem>
<LineItem>
<SequenceNumber>3</SequenceNumber>
<EndDateTime>2020-05-15T11:46:17</EndDateTime>3</tri:LineNumber>Item</tri:LineType>SELL</tri:ActionCode>
</tri:NativeLineDetail>
<SupplementalData>
<ITEM_KEY type="String">12345000001</ITEM_KEY>
<_datasource_ type="String">Keyboard</_datasource_>
<msrpPrice type="Money">0.00</msrpPrice>
<CUSTOMER_PROFILE_ID type="String" />
</SupplementalData>
<EntryMethod>Keyed</EntryMethod>
<Sale ItemType="Stock">
<POSIdentity>
<POSItemID>12345000001</POSItemID>
</POSIdentity>
<MerchandiseHierarchy Level="Department">
2221
</MerchandiseHierarchy>
<Description>SILK TIE</Description>
<RegularSalesUnitPrice>45.00</RegularSalesUnitPrice>
<ActualSalesUnitPrice>35.00</ActualSalesUnitPrice>
<ExtendedAmount>35.00</ExtendedAmount>
<Quantity Units="1" UnitOfMeasureCode="Each">
1
</Quantity>
<RetailPriceModifier MethodCode="PriceOverride">
<SequenceNumber>1</SequenceNumber>
<Amount Action="Subtract">10</Amount>
<PreviousPrice>45.00</PreviousPrice>
<SupplementalData>
<DISCOUNT_DESCRIPTION />
<SOURCE_LINE_NUMBER>0</SOURCE_LINE_NUMBER>
<LINE_NUMBER>3</LINE_NUMBER>
<PRICE>35.00</PRICE>
<REASON_CODE>Markdown</REASON_CODE>
LINE_NUMBER</PARAMETER>
<_datasource_>Keyboard</_datasource_>
<loyaltyPromotionMessage />
<loyaltyPromotionAcceptedQualifier />
</SupplementalData>
</RetailPriceModifier>
<Tax>
<TaxGroupID>TG1</TaxGroupID>
<TaxableAmount>-1</TaxableAmount>
</Tax>
</Sale>
</LineItem>
<Total TotalType="TransactionNetAmount"> <Amount>219.98</Amount> </Total>
<Total TotalType="TransactionGrandAmount"> <Amount>237.58</Amount> </Total>
<Total TotalType="TransactionDueAmount"> <Amount>237.58</Amount> </Total>
<Associate> <AssociateID>222</AssociateID> </Associate>
</tri:RetailTransaction>
</Body>
</POSLog>

Here is some DDL that matches the above data, and produces two related streams. produces one row, so the row tag xpath (which must be a full, absolute xpath that starts at the document root) is /PosLog/Body/tri:RetailTransaction.

All the columns come from inside this XML element, so they are defined as xpaths relative to it (no leading slash). We take the first three subelements to be a 3-part primary key that identifies the transaction. Later we will use this to tie the LineItem rows to the RetailTransaction rw.

  • Some columns are direct subelements of the row element, such as RetaitStoreID and TillID. This is the simplest case. The xpath for a column defaults to the name of the column as a relative path, so these options could be omitted.
  • Some columns are deeper sublements of the row element, like TransactionType
  • Some columns are defined by an xpath with a condition.
  • The transaction has three distinct total amounts, which become the three columns TransactionNetTotal, TransactionGrandTotal, and TransactionAmountDue. These come from three XML elements that have the same path, Total/Amount*, but that are distinguished by different values of the attribute .

Note that the column “BeginDateTime” is declared as VARCHAR, though it is actually a timestamp, because the xml parser cannot produce a SQL timestamp value.

CREATE FOREIGN STREAM "RetailTransactions"(
       -- unique key is (RetailStoreID, WorkstationID, SequenceNumber)
       "RetailStoreID" VARCHAR(128),
       "WorkstationID" VARCHAR(40),
       "SequenceNumber"BIGINT,
       "BeginDateTime" VARCHAR(25),-- TIMESTAMP,
       "DeviceID" BIGINT,
       "SubdeviceID"BIGINT,
       "TransactionType"VARCHAR(40),
       "TransactionItemCount" INT,
       "KeyToken" VARCHAR(128),
       "OperatorID"INT,
       "TillID"INT,
       "TransactionNetTotal"DECIMAL(14,2),
       "TransactionGrandTotal" DECIMAL(14,2),
       "TransactionAmountDue" DECIMAL(14,2)
       ) SERVER "FileReaderServer" OPTIONS (
       DIRECTORY '/test/data',
       FILENAME_PATTERN '.*xml',
       CHARACTER_ENCODING 'ISO-8859-1',
       parser 'XML',
       PARSER_XML_ROW_TAGS '/POSLog/Body/tri:RetailTransaction',
       "RetailStoreID_XPATH"'RetailStoreID',
       "WorkstationID_XPATH"'WorkstationID',
       "SequenceNumber_XPATH" 'SequenceNumber',
       "BeginDateTime_XPATH"'BeginDateTime',
       "DeviceID_XPATH"'tri:NativeTrxDetail/tri:DeviceID',
       "SubdeviceID_XPATH" 'tri:NativeTrxDetail/tri:SubdeviceID',
       "TransactionType_XPATH" 'tri:NativeTrxDetail/tri:TransactionType',
       "TransactionItemCount_XPATH"
       'tri:NativeTrxDetail/tri:TransactionItemCount',
       "KeyToken_XPATH"'tri:NativeTrxDetail/tri:KeyToken',
       "OperatorID_XPATH" 'OperatorID',
       "TillID_XPATH" 'TillID',
       "TransactionNetTotal_XPATH" 'Total[@TotalType="TransactionNetAmount"]/Amount',
       "TransactionGrandTotal_XPATH"'Total[@TotalType="TransactionGrandAmount"]/Amount',
       "TransactionAmountDue_XPATH"'Total[@TotalType="TransactionDueAmount"]/Amount'
);

Finally, we look at the definition of the stream of LineItems, which illustrates a few more features of the parser. Each row comes from an element <LineItem> which is inside a <tri:RetailTransaction>, as specified by the row tag. The first three columns are foreign keys that refer to a row in the parent stream RetailTransactions. In fact the parser finds the values of the foreign keys by looking inside the enclosing <tri:RetailTransaction> element, since the xpaths for these three fields are absolute paths that lead into the enclosing element.

Again we see columns that are direct subelements (like ItemSequenceNumber) and deeper sublements (like LineNumber). Note that the subelement /Sale/Quantity is the source of two columns: the body of the element produces the column “Quantity” and the attribute UnitOfMeasureCode produces the column Units.

CREATE FOREIGN STREAM "LineItems" (
-- foreign keys: (RetailStoreID, WorkstationID, TransactionSequenceNumber)
"RetailStoreID" VARCHAR(128),
"WorkstationID" VARCHAR(40),
"TransactionSequenceNumber" BIGINT,
"ItemSequenceNumber"BIGINT, -- unique
"LineNumber"INT,
"ActionCode"VARCHAR(40),
"ITEM_KEY" VARCHAR(40),
"EntryMethod"VARCHAR(40),
"ItemType" VARCHAR(40),
"POSItemID" BIGINT,
"Department"INT,
"Quantity" INT,
"Units" VARCHAR(40),
"RegularSalesUnitPrice" DECIMAL(14,2),
"ActualSalesUnitPrice" DECIMAL(14,2),
"ExtendedAmount"DECIMAL(14,2),
)
SERVER "FileReaderServer"
OPTIONS (
DIRECTORY '/work/dt/src/cust/bigbox/test/data',
FILENAME_PATTERN '.*xml',
CHARACTER_ENCODING 'ISO-8859-1',
parser 'XML',
PARSER_XML_ROW_TAGS'/POSLog/Body/tri:RetailTransaction/LineItem',
"RetailStoreID_XPATH" '/POSLog/Body/tri:RetailTransaction/RetailStoreID',
"WorkstationID_XPATH" '/POSLog/Body/tri:RetailTransaction/WorkstationID',
"TransactionSequenceNumber_XPATH" '/POSLog/Body/tri:RetailTransaction/SequenceNumber',
"ItemSequenceNumber_XPATH"'SequenceNumber',
"LineNumber_XPATH"'tri:NativeLineDetail/tri:LineNumber',
"ActionCode_XPATH"'tri:NativeLineDetail/tri:ActionCode',
"ITEM_KEY_XPATH" 'SupplementalData/ITEM_KEY',
"EntryMethod_XPATH"'EntryMethod',
"ItemType_XPATH" 'Sale@ItemType',
"POSItemID_XPATH" 'Sale[@ItemType="Stock"]/POSIdentity/POSItemID',
"Department_XPATH"'Sale[@ItemType="Stock"]/MerchandiseHierarchy[@Level="Department"]',
"RegularSalesUnitPrice_XPATH" 'Sale[@ItemType="Stock"]/RegularSalesUnitPrice',
"ActualSalesUnitPrice_XPATH" 'Sale[@ItemType="Stock"]/ActualSalesUnitPrice',
"ExtendedAmount_XPATH"'Sale[@ItemType="Stock"]/ExtendedAmount',
"Quantity_XPATH" 'Sale[@ItemType="Stock"]/Quantity',
"Units_XPATH" 'Sale[@ItemType="Stock"]/Quantity@UnitOfMeasureCode'
);

Sample Properties Implementing ECD Agent to Parse XML Files

To parse XML files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RecordType (VARCHAR(128) COL1, VARCHAR(40) COL2, BIGINT COL3, BIGINT COL4, INT COL5, VARCHAR(40), VARCHAR(40) COL6, VARCHAR(40) COL7, VARCHAR(40) COL9, INT COL 10, INT COL12, VARCHAR(40) COL13, DECIMAL(14,2) COL14, DECIMAL(14,2) COL15, DECIMAL(14,2) COL16)
DIRECTORY=/work/dt/src/cust/bigbox/test/data
PARSER=XML
CHARACTER_ENCODING=UTF-8
SKIP_HEADER=TRUE
PARSER_XML_ROW_TAGS=/POSLog/Body/tri:RetailTransaction/LineItem
RetailStoreID_XPATH=/POSLog/Body/tri:RetailTransaction/RetailStoreID
WorkstationID_XPATH=/POSLog/Body/tri:RetailTransaction/WorkstationID
TransactionSequenceNumber_XPATH=/POSLog/Body/tri:RetailTransaction/SequenceNumber
ItemSequenceNumber_XPATH'SequenceNumber
LineNumber_XPATH'tri:NativeLineDetail/tri:LineNumber
ActionCode_XPATH'tri:NativeLineDetail/tri:ActionCode
ITEM_KEY_XPATH =SupplementalData/ITEM_KEY
EntryMethod_XPATH =EntryMethod
ItemType_XPATH =Sale@ItemType
POSItemID_XPATH=Sale[@ItemType=Stock]/POSIdentity/POSItemID
Department_XPATH'Sale[@ItemType=Stock]/MerchandiseHierarchy[@Level=Department]
RegularSalesUnitPrice_XPATH=Sale[@ItemType=Stock]/RegularSalesUnitPrice
ActualSalesUnitPrice_XPATH =Sale[@ItemType=Stock]/ActualSalesUnitPrice
ExtendedAmount_XPATH'Sale[@ItemType=Stock]/ExtendedAmount
Quantity_XPATH =Sale[@ItemType=Stock]/Quantity
Units_XPATH=Sale[@ItemType=Stock]/Quantity@UnitOfMeasureCode

Ingestion Rules for XML

s-Server applies the following coercion rules when parsing XML data. All missing cells are cast as NULL.

XML source attribute/element numeric string non-numeric string empty cell
BIGINT, DECIMAL, DOUBLE, INT, SMALLINT, REAL, TINYINT Raise an exception if the number lies beyond the maximum or minimum boundary of the target type. If the number has any decimal digits (including the vacuous .0) and it is being ingested into an integer column, then raise an error. Raise an error. Error (same as “)
BOOLEAN FALSE If the string is exactly “true”, then the result is true. Otherwise, the result is false. False (same as “)
BINARY, VARBINARY Raise an error. If the string, without quotes, is a valid SQL Standard BINARY literal, then we will ingest it, truncating and 0-filling as necessary. A Standard BINARY literal has the form X’…’ where … is a sequence of (case-insensitive) hex digits. Error (same as “)
CHAR, VARCHAR Put double-quotes around the number and then ingest subject to the truncation/padding rules for strings. If the string won’t fit in the target SQL character value, then excess trailing characters are discarded. CHAR values are space-padded up to their declared length. Treat as empty string “, space-padding as necessary.
TIME, TIMESTAMP Raise an error. OK if the string parses as a DATE/TIME/TIMESTAMP. Otherwise, raise an error. Strings are parsed per ISO standards at https://www.w3.org/TR/NOTE-datetime](https://www.w3.org/TR/NOTE-datetime) Error (same as “)