To read XML data with the ECD Adapter and Agent (ECDA Framework), you need to specify a location for the data to be read and information on the data structure.
The XML parser converts data from the XML file format into regular SQLstream rows. To read data from XML files using the ECD Adapter or Agent, you define a foreign stream based on the input sources that uses the XML parser, or, with the ECDA agent, define a properties file with similar options. As with other input data formats, you specify the location of the files and other attributes that control how the files are discovered and read. Like any stream definition, the foreign stream has a definite row-type: a set of columns with names and data-types.
XML is a complex and flexible format, so it is also necessary to specify how a portion of the XML input is recognized as the source of one row, and how parts of that XML input are recognized as the column values of that row. This is done by pattern matching, with patterns defined as simple xpath expressions. There is an xpath that matches a row, and an xpath for each column. More information on XPath can be found at http://en.wikipedia.org/wiki/XPath.
See Ingestion Rules for XML )-for details on how s-Server parses XML data.
Note: You can also input data in larger chunks and parse it later using the Parser UDX ). This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server.
The s-Server trace log includes information on readers’ and parsers’ progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide.
As input files are read, their contents are appended to a long stream comprised of XML data. As data flows in, s-Server parses the data, watching for xpath matches. Whenever the row xpath matches a chunk of input, s-Server emits an output row has been found. Inside this chunk, as each column xpath matches, the value of that column has been found as text.
Only simple xpaths are allowed as patterns. A pattern may be absolute (starts with a slash) or relative (doesn’t start with a slash). s-Server supports patterns built from tag-names (like /a/b/c), attributes (like /a/b/c@d), and simple conditions (like /a/b/c[@d=“foo”]).
s-Server supports patterns built from tag-names (like /a/b/c), attributes (like /a/b/c@d), and simple conditions (like /a/b/c[@d=“foo”]).
A pattern like /a/b/c matches tag <c> when nested inside a <b> inside an <a>. The value of the match is the text inside the tag: all the text between this
A pattern like /a/b/c@d is almost the same: it matches tag <c> inside a <b> inside an <a>, but the value of match is the d attribute of the tag. That is to say,
s-Server also supports simple conditions in xpaths, namely [@a] (which means that attribute a is present) and [@a=x] (which means that attribute a is present and has the value “x’). Here x must be a string literal. So the path /a/b[@type=“cheese”]/c is like /a/b/c, but with the added requirement that the <b> element found has the attribute type=“cheese”.
An absolute xpath for a column means start matching from the xml document root. Note that the input stream is a series of well balanced xml documents, so the root occurs over and over again.
A relative xpath for a column means start matching from the start of the row. The row xpath itself must be absolute.
Note: More complicated xpaths are not supported. In particular, xpath conditions that test calculated expressions are not supported. You can get the same effect by filtering the foreign stream with a SQL WHERE clause. See the topic WHERE Clause of the SELECT Statement ) in the SQL Reference Guide for more details.
Note: You cannot have two columns in the same stream defined by exactly the same pattern. But two xpath patterns can be similar or overlap.
It is common for an XML item to contain one or more nested subitems. This is naturally represented in SQL as two foreign streams parsed from the same input files: a stream of the main items, and a related stream of the subitems. On the SQL side, the streams are tied together by
To produce this, just use the same xpath for the primary key and the foreign key. For each subitem, the value of the foreign key will be found outside the subitem xml, inside the main item xml. Hence the xpath must absolute, and not based on the subitem row. However, the foreign key must occur in the main item XML before any subitems.
Here is a very simple example that illustrates the above points.
Note: The examples below use the file system as an input system. To parse XML over other systems, such as Kafka, Kinesis, a network socket, a WebSocket, HTTP or AMQP, you would need to specify options for these formats. See Reading from Other Sources ) for more details.
The XML input looks like:
<row><name>Elmer Fudd</name><occupation>hunter</occupation></row> <row><name>Bugs Bunny</name><occupation>wabbit</occupation></row>
the foreign server was defined as:
CREATE SERVER FileReaderServer TYPE 'FILE' FOREIGN DATA WRAPPER ECDA;
and the stream definition is
CREATE FOREIGN STREAM PLAYERS ( namevarchar(16), occupation varchar(16) ) SERVER FileReaderServer OPTIONS ( DIRECTORY '/data/cartoons', FILENAME_PATTERN '.*xml', PARSER 'XML', PARSER_XML_ROW_TAGS'/row', name_xpath 'name', occupation_xpath 'occupation' );
Here is a more realistic example, taken from an actual application. The input data is a record of retail sales transactions. Each transaction contains several line items, some of which are things purchased, but others are only cash register actions. The xml data is very detailed, and we will pick out a few (defined as xpaths).
We will produce two related streams, a stream of sales transactions and a stream of line items. To tie them together, we recognize some identifier fields in the sales transactions and use the same values as foreign keys in the stream of line items. First, a sample input data file, much abridged. Skip past this and refer back from the explanation of the DDL:
<POSLog xmlns:tri="http://www.triversity.com/TE/integration/"> <Header> <MessageId /> <Timestamp>2020-05-15T11:47:05</Timestamp> <Originator>TE TransactionPostWorkflow</Originator> <SequenceNumber>1</SequenceNumber> </Header> <Body> <RetailStoreID>Triversity.1</RetailStoreID> <WorkstationID>POS.1.1</WorkstationID> <SequenceNumber>198</SequenceNumber> <BusinessDayDate>2020-04-26</BusinessDayDate> <BeginDateTime>2020-05-15T11:46:12</BeginDateTime> <EndDateTime>2020-05-15T11:47:05</EndDateTime> <OperatorID>222</OperatorID> <CurrencyCode>USD</CurrencyCode> <TillID>3</TillID> <TillSupervisor>222</TillSupervisor> <LineItem> <SequenceNumber>2</SequenceNumber> <EndDateTime>2020-05-15T11:46:12</EndDateTime>2</tri:LineNumber>Item</tri:LineType>SELL</tri:ActionCode> </tri:NativeLineDetail> <SupplementalData> <ITEM_KEY type="String">12345000001</ITEM_KEY> <QUANTITY type="BigDecimal">3</QUANTITY> <_datasource_ type="String">Keyboard</_datasource_> <msrpPrice type="Money">0.00</msrpPrice> <CUSTOMER_PROFILE_ID type="String" /> </SupplementalData> <EntryMethod>Keyed</EntryMethod> <Sale ItemType="Stock"> <POSIdentity> <POSItemID>12345000001</POSItemID> </POSIdentity> <MerchandiseHierarchy Level="Department"> 2221 </MerchandiseHierarchy> <Description>SILK TIE</Description> <RegularSalesUnitPrice>45.00</RegularSalesUnitPrice> <ActualSalesUnitPrice>45.00</ActualSalesUnitPrice> <ExtendedAmount>135.00</ExtendedAmount> <Quantity Units="1" UnitOfMeasureCode="Each"> 3 </Quantity> <Tax> <TaxGroupID>TG1</TaxGroupID> <TaxableAmount>-1</TaxableAmount> </Tax> </Sale> </LineItem> <LineItem> <SequenceNumber>3</SequenceNumber> <EndDateTime>2020-05-15T11:46:17</EndDateTime>3</tri:LineNumber>Item</tri:LineType>SELL</tri:ActionCode> </tri:NativeLineDetail> <SupplementalData> <ITEM_KEY type="String">12345000001</ITEM_KEY> <_datasource_ type="String">Keyboard</_datasource_> <msrpPrice type="Money">0.00</msrpPrice> <CUSTOMER_PROFILE_ID type="String" /> </SupplementalData> <EntryMethod>Keyed</EntryMethod> <Sale ItemType="Stock"> <POSIdentity> <POSItemID>12345000001</POSItemID> </POSIdentity> <MerchandiseHierarchy Level="Department"> 2221 </MerchandiseHierarchy> <Description>SILK TIE</Description> <RegularSalesUnitPrice>45.00</RegularSalesUnitPrice> <ActualSalesUnitPrice>35.00</ActualSalesUnitPrice> <ExtendedAmount>35.00</ExtendedAmount> <Quantity Units="1" UnitOfMeasureCode="Each"> 1 </Quantity> <RetailPriceModifier MethodCode="PriceOverride"> <SequenceNumber>1</SequenceNumber> <Amount Action="Subtract">10</Amount> <PreviousPrice>45.00</PreviousPrice> <SupplementalData> <DISCOUNT_DESCRIPTION /> <SOURCE_LINE_NUMBER>0</SOURCE_LINE_NUMBER> <LINE_NUMBER>3</LINE_NUMBER> <PRICE>35.00</PRICE> <REASON_CODE>Markdown</REASON_CODE> LINE_NUMBER</PARAMETER> <_datasource_>Keyboard</_datasource_> <loyaltyPromotionMessage /> <loyaltyPromotionAcceptedQualifier /> </SupplementalData> </RetailPriceModifier> <Tax> <TaxGroupID>TG1</TaxGroupID> <TaxableAmount>-1</TaxableAmount> </Tax> </Sale> </LineItem> <Total TotalType="TransactionNetAmount"> <Amount>219.98</Amount> </Total> <Total TotalType="TransactionGrandAmount"> <Amount>237.58</Amount> </Total> <Total TotalType="TransactionDueAmount"> <Amount>237.58</Amount> </Total> <Associate> <AssociateID>222</AssociateID> </Associate> </tri:RetailTransaction> </Body> </POSLog>
Here is some DDL that matches the above data, and produces two related streams. produces one row, so the row tag xpath (which must be a full, absolute xpath that starts at the document root) is /PosLog/Body/tri:RetailTransaction.
All the columns come from inside this XML element, so they are defined as xpaths relative to it (no leading slash). We take the first three subelements to be a 3-part primary key that identifies the transaction. Later we will use this to tie the LineItem rows to the RetailTransaction rw.
Note that the column “BeginDateTime” is declared as VARCHAR, though it is actually a timestamp, because the xml parser cannot produce a SQL timestamp value.
CREATE FOREIGN STREAM "RetailTransactions"( -- unique key is (RetailStoreID, WorkstationID, SequenceNumber) "RetailStoreID" VARCHAR(128), "WorkstationID" VARCHAR(40), "SequenceNumber"BIGINT, "BeginDateTime" VARCHAR(25),-- TIMESTAMP, "DeviceID" BIGINT, "SubdeviceID"BIGINT, "TransactionType"VARCHAR(40), "TransactionItemCount" INT, "KeyToken" VARCHAR(128), "OperatorID"INT, "TillID"INT, "TransactionNetTotal"DECIMAL(14,2), "TransactionGrandTotal" DECIMAL(14,2), "TransactionAmountDue" DECIMAL(14,2) ) SERVER "FileReaderServer" OPTIONS ( DIRECTORY '/test/data', FILENAME_PATTERN '.*xml', CHARACTER_ENCODING 'ISO-8859-1', parser 'XML', PARSER_XML_ROW_TAGS '/POSLog/Body/tri:RetailTransaction', "RetailStoreID_XPATH"'RetailStoreID', "WorkstationID_XPATH"'WorkstationID', "SequenceNumber_XPATH" 'SequenceNumber', "BeginDateTime_XPATH"'BeginDateTime', "DeviceID_XPATH"'tri:NativeTrxDetail/tri:DeviceID', "SubdeviceID_XPATH" 'tri:NativeTrxDetail/tri:SubdeviceID', "TransactionType_XPATH" 'tri:NativeTrxDetail/tri:TransactionType', "TransactionItemCount_XPATH" 'tri:NativeTrxDetail/tri:TransactionItemCount', "KeyToken_XPATH"'tri:NativeTrxDetail/tri:KeyToken', "OperatorID_XPATH" 'OperatorID', "TillID_XPATH" 'TillID', "TransactionNetTotal_XPATH" 'Total[@TotalType="TransactionNetAmount"]/Amount', "TransactionGrandTotal_XPATH"'Total[@TotalType="TransactionGrandAmount"]/Amount', "TransactionAmountDue_XPATH"'Total[@TotalType="TransactionDueAmount"]/Amount' );
Finally, we look at the definition of the stream of LineItems, which illustrates a few more features of the parser. Each row comes from an element <LineItem> which is inside a <tri:RetailTransaction>, as specified by the row tag. The first three columns are foreign keys that refer to a row in the parent stream RetailTransactions. In fact the parser finds the values of the foreign keys by looking inside the enclosing <tri:RetailTransaction> element, since the xpaths for these three fields are absolute paths that lead into the enclosing element.
Again we see columns that are direct subelements (like ItemSequenceNumber) and deeper sublements (like LineNumber). Note that the subelement /Sale/Quantity is the source of two columns: the body of the element produces the column “Quantity” and the attribute UnitOfMeasureCode produces the column Units.
CREATE FOREIGN STREAM "LineItems" ( -- foreign keys: (RetailStoreID, WorkstationID, TransactionSequenceNumber) "RetailStoreID" VARCHAR(128), "WorkstationID" VARCHAR(40), "TransactionSequenceNumber" BIGINT, "ItemSequenceNumber"BIGINT, -- unique "LineNumber"INT, "ActionCode"VARCHAR(40), "ITEM_KEY" VARCHAR(40), "EntryMethod"VARCHAR(40), "ItemType" VARCHAR(40), "POSItemID" BIGINT, "Department"INT, "Quantity" INT, "Units" VARCHAR(40), "RegularSalesUnitPrice" DECIMAL(14,2), "ActualSalesUnitPrice" DECIMAL(14,2), "ExtendedAmount"DECIMAL(14,2), ) SERVER "FileReaderServer" OPTIONS ( DIRECTORY '/work/dt/src/cust/bigbox/test/data', FILENAME_PATTERN '.*xml', CHARACTER_ENCODING 'ISO-8859-1', parser 'XML', PARSER_XML_ROW_TAGS'/POSLog/Body/tri:RetailTransaction/LineItem', "RetailStoreID_XPATH" '/POSLog/Body/tri:RetailTransaction/RetailStoreID', "WorkstationID_XPATH" '/POSLog/Body/tri:RetailTransaction/WorkstationID', "TransactionSequenceNumber_XPATH" '/POSLog/Body/tri:RetailTransaction/SequenceNumber', "ItemSequenceNumber_XPATH"'SequenceNumber', "LineNumber_XPATH"'tri:NativeLineDetail/tri:LineNumber', "ActionCode_XPATH"'tri:NativeLineDetail/tri:ActionCode', "ITEM_KEY_XPATH" 'SupplementalData/ITEM_KEY', "EntryMethod_XPATH"'EntryMethod', "ItemType_XPATH" 'Sale@ItemType', "POSItemID_XPATH" 'Sale[@ItemType="Stock"]/POSIdentity/POSItemID', "Department_XPATH"'Sale[@ItemType="Stock"]/MerchandiseHierarchy[@Level="Department"]', "RegularSalesUnitPrice_XPATH" 'Sale[@ItemType="Stock"]/RegularSalesUnitPrice', "ActualSalesUnitPrice_XPATH" 'Sale[@ItemType="Stock"]/ActualSalesUnitPrice', "ExtendedAmount_XPATH"'Sale[@ItemType="Stock"]/ExtendedAmount', "Quantity_XPATH" 'Sale[@ItemType="Stock"]/Quantity', "Units_XPATH" 'Sale[@ItemType="Stock"]/Quantity@UnitOfMeasureCode' );
To parse XML files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:
ROWTYPE=RecordType (VARCHAR(128) COL1, VARCHAR(40) COL2, BIGINT COL3, BIGINT COL4, INT COL5, VARCHAR(40), VARCHAR(40) COL6, VARCHAR(40) COL7, VARCHAR(40) COL9, INT COL 10, INT COL12, VARCHAR(40) COL13, DECIMAL(14,2) COL14, DECIMAL(14,2) COL15, DECIMAL(14,2) COL16) DIRECTORY=/work/dt/src/cust/bigbox/test/data PARSER=XML CHARACTER_ENCODING=UTF-8 SKIP_HEADER=TRUE PARSER_XML_ROW_TAGS=/POSLog/Body/tri:RetailTransaction/LineItem RetailStoreID_XPATH=/POSLog/Body/tri:RetailTransaction/RetailStoreID WorkstationID_XPATH=/POSLog/Body/tri:RetailTransaction/WorkstationID TransactionSequenceNumber_XPATH=/POSLog/Body/tri:RetailTransaction/SequenceNumber ItemSequenceNumber_XPATH'SequenceNumber LineNumber_XPATH'tri:NativeLineDetail/tri:LineNumber ActionCode_XPATH'tri:NativeLineDetail/tri:ActionCode ITEM_KEY_XPATH =SupplementalData/ITEM_KEY EntryMethod_XPATH =EntryMethod ItemType_XPATH =Sale@ItemType POSItemID_XPATH=Sale[@ItemType=Stock]/POSIdentity/POSItemID Department_XPATH'Sale[@ItemType=Stock]/MerchandiseHierarchy[@Level=Department] RegularSalesUnitPrice_XPATH=Sale[@ItemType=Stock]/RegularSalesUnitPrice ActualSalesUnitPrice_XPATH =Sale[@ItemType=Stock]/ActualSalesUnitPrice ExtendedAmount_XPATH'Sale[@ItemType=Stock]/ExtendedAmount Quantity_XPATH =Sale[@ItemType=Stock]/Quantity Units_XPATH=Sale[@ItemType=Stock]/Quantity@UnitOfMeasureCode
s-Server applies the following coercion rules when parsing XML data. All missing cells are cast as NULL.
|XML source attribute/element||numeric string||non-numeric string||empty cell|
|BIGINT, DECIMAL, DOUBLE, INT, SMALLINT, REAL, TINYINT||Raise an exception if the number lies beyond the maximum or minimum boundary of the target type. If the number has any decimal digits (including the vacuous .0) and it is being ingested into an integer column, then raise an error.||Raise an error.||Error (same as “)|
|BOOLEAN||FALSE||If the string is exactly “true”, then the result is true. Otherwise, the result is false.||False (same as “)|
|BINARY, VARBINARY||Raise an error.||If the string, without quotes, is a valid SQL Standard BINARY literal, then we will ingest it, truncating and 0-filling as necessary. A Standard BINARY literal has the form X’…’ where … is a sequence of (case-insensitive) hex digits.||Error (same as “)|
|CHAR, VARCHAR||Put double-quotes around the number and then ingest subject to the truncation/padding rules for strings.||If the string won’t fit in the target SQL character value, then excess trailing characters are discarded. CHAR values are space-padded up to their declared length.||Treat as empty string “, space-padding as necessary.|
|TIME, TIMESTAMP||Raise an error.||OK if the string parses as a DATE/TIME/TIMESTAMP. Otherwise, raise an error. Strings are parsed per ISO standards at https://www.w3.org/TR/NOTE-datetime](https://www.w3.org/TR/NOTE-datetime)||Error (same as “)|