Parsing CSV

To read CSV formatted data, you need to specify a separator character (such as a comma or a pipe) for values. You can also indicate whether or not the file has a header. When the Extensible Common Data Framework parses CSV, each row becomes a row in the stream. Columns are identified by the separator character, which can be a comma, a pipe, or any other character you designate. s-Server matches data types according to its Ingestion Rules for CSV.

Note: You can also input data in larger chunks and parse it later using the Parser UDX ). This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server.

The s-Server trace log includes information on readers' and parsers' progress. See Periodic Parser Statistics Logging in the Administering Guavus SQLstream guide.

Sample Foreign Stream Implementing ECD Adapter to Parse CSV Files

The following example sets up a server called CSVReaderServer, then creates and sets a schema called schema_csv. It then sets up a foreign stream that will parse columns called id, reported_at, shift_no, trip_no, lat, lon, speed, bearing, driver_no, and highway from a file in /tmp/ called buses.log. To parse CSV over other input/output systems, such as Kafka, Kinesis, a network socket, a WebSocket, HTTP or AMQP, you would need to specify options for these formats. See [Reading from Other Sources]/integrating-sqlstream/reading-data/) for more details.

Note: The examples below use the file system as an input system. To parse CSV over other systems, such as Kafka or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

create or replace server CSVReaderServer type 'FILE'
foreign data wrapper ECDA;
create or replace schema schema_csv;
set schema 'schema_csv';
CREATE OR REPLACE FOREIGN STREAM read_from_csv
(
    "id" BIGINT,
    "reported_at" TIMESTAMP,
    "shift_no" VARCHAR(8),
    "trip_no" VARCHAR(4),
    "lat" DOUBLE,
    "lon" DOUBLE,
    "speed" INTEGER,
    "bearing" INTEGER,
    "driver_no" BIGINT,
    "highway" VARCHAR(8)
)

    SERVER CSVReaderServer

    OPTIONS (
        "PARSER" 'CSV',
        "CHARACTER_ENCODING" 'UTF-8',
        "SEPARATOR" ',',
        "SKIP_HEADER" 'false',
        "DIRECTORY" '/tmp/',
        "FILENAME_PATTERN" 'buses\.log'
    );

To actually begin reading from the file, use a statement along the lines of

SELECT stream * from CSVReaderStream;

Mapping Columns with the CSV Parser

You can use this option to parse only some of the columns,in a CSV file, or to reorder the columns from CSV data.

To implement this option, you implement code along the following lines:

CREATE OR REPLACE FOREIGN STREAM my-stream (
      COMMON_NAME VARCHAR(64),
      SPECIES VARCHAR(64),  
      GENUS VARCHAR(64)
)
SERVER FileReaderServer
OPTIONS(
   DIRECTORY '/animals/',
   FILENAME_PATTERN 'animals.csv',
   COLUMN_MAPPING ',COMMON_NAME,GENUS,SPECIES',
   PARSER 'CSV',
   STATIC_FILES 'true',
   SKIP_HEADER 'true'
);

When you run a SELECT against this stream, the parser will take the following actions:

  • It will not parse the first column in the CSV file (note leading comma).
  • It will map the second column in the CSV file to COMMON_NAME.
  • It will map the third column in the CSV file to GENUS (even though the order of stream columns differs from the CSV column order).
  • It will map the fourth column in the CSV file to SPECIES (even though the order of stream columns differs from the CSV column order).

Foreign Stream Options for Parsing CSV Data

Option Definition
SKIP_HEADER True or false; defaults to false. Specifies if the parser should skip the first row of input files. Usually, you would do this if the first row of data contains column headers.
QUOTE_CHARACTER Lets you specify an expected quotation character. There is no default for quote character.
ROW_SEPARATOR Lets you specify a character that splits lines in the source. Defaults to /n.For example, in the case of reading from a DOS format file where line separators are CR/LF rather than just LF, you will need to specifyrow_separator U&'\000D\000A'(This is the hexidecimal Unicode value for CR/LF. Seehttp://unicode.org/standard/reports/tr13/tr13-5.html.
SEPARATOR Lets you specify a character that separates key-value pairs. Default is comma (,).
CHARACTER_ENCODING Character set for data.

Sample Properties Implementing ECD Agent to Parse CSV Files

To parse CSV files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(2040) id, VARCHAR(2040) reported_at, VARCHAR(2040) shift_no, VARCHAR(2040) trip_no, VARCHAR(2040) route_variant_id) DIRECTORY=/TMP FILENAME_PATTERN=TRANSACTIONS\.LOG PARSER=CSV CHARACTER_ENCODING=UTF-8 SKIP_HEADER=TRUE

Ingestion Rules for CSV

s-Server applies the following coercion rules when parsing CSV data. All empty cells are cast as NULL.

CSV Source Cell numeric string non-numeric string
BIGINT, DECIMAL, DOUBLE, INT, SMALLINT, REAL, TINYINT Raise an exception if the JSON number lies beyond the maximum or minimum boundary of the target type. If the number has any decimal digits (including the vacuous .0) and it is being ingested into an integer column, then raise an error. Raise an error.
VARBINARY Raise an error. If the string, without quotes, is a valid SQL Standard BINARY literal, then we will ingest it, truncating and 0-filling as necessary. A Standard BINARY literal has the form X'...' where ... is a sequence of (case-insensitive) hex digits.
CHAR, VARCHAR Put double-quotes around the number and then ingest subject to the truncation/padding rules for strings. If the string won't fit in the target SQL character value, then excess trailing characters are discarded. CHAR values are space-padded up to their declared length.
TIME, TIMESTAMP Raise an error. OK if the string parses as a DATE/TIME/TIMESTAMP. Otherwise, raise an error.
Strings are parsed per ISO standards at https://www.w3.org/TR/NOTE-datetime