Cassandra Input0



Configure Cassandra Input
Cassandra Input is an input step that enables data to be read from a Cassandra column family (table) as part of an ETL transformation.

Option

Definition

Step name

The name of this step as it appears in the transformation workspace.

Cassandra host

Connection host name input field.

Cassandra port

Connection host port number input field.

Username

Input field for target keyspace and/or family (table) authentication details.

Password

Input field for target keyspace and/or family (table) authentication details.

Keyspace

Input field for the keyspace (database) name.

Use query compression

If checked, tells the step whether or not to compress the text of the CQL query before sending it to the server.

Show schema

Opens a dialog that shows metadata for the column family named in the CQL SELECT query.




CQL SELECT Query
The large text box at the bottom of the dialog enables you to enter a CQL SELECT statement to be executed. Only a single SELECT query is accepted by the step.
SELECT [FIRST N] [REVERSED] <SELECT EXPR>
FROM <COLUMN FAMILY> [USING <CONSISTENCY>] [WHERE <CLAUSE>] [LIMIT N];
Important: Cassandra Input does not support the CQL range notation, for instance name1..nameN, for specifying columns in a SELECT query.
Select queries may name columns explicitly (in a comma separated list) or use the * wildcard. If the wildcard is used then only those columns defined in the metadata for the column family in question are returned. If columns are selected explicitly, then the name of each column must be enclosed in single quotation marks. Because Cassandra is a sparse column oriented database, as is the case with HBase, it is possible for rows to contain varying numbers of columns which might or might not be defined in the metadata for the column family. The Cassandra Input step can emit columns that are not defined in the metadata for the column family in question if they are explicitly named in the SELECT clause. Cassandra Input uses type information present in the metadata for a column family. This, at a minimum, includes a default type (column validator) for the column family. If there is explicit metadata for individual columns available, then this is used for type information, otherwise the default validator is used.

Option

Definition

LIMIT

If omitted, Cassandra assumes a default limit of 10,000 rows to be returned by the query. If the query is expected to return more than 10,000 rows an explicit LIMIT clause must be added to the query.

FIRST N

Returns the first N [where N is determined by the column sorting strategy used for the column family in question] column values from each row, if the column family in question is sparse then this may result in a different N (or less) column values appearing from one row to the next. Because PDI deals with a constant number of fields between steps in a transformation, Cassandra rows that do not contain particular columns are output as rows with null field values for non-existent columns. Cassandra's default for FIRST (if omitted from the query) is 10,000 columns. If a query is expected to return more than 10,000 columns, then an explicit FIRST must be added to the query.

REVERSED

Option causes the sort order of the columns returned by Cassandra for each row to be reversed. This may affect which values result from a FIRST N option, but does not affect the order of the columns output by Cassandra Input.

WHERE clause

Clause provides for filtering the rows that appear in results. The clause can filter on a key name, or range of keys, and in the case of indexed columns, on column values. Key filters are specified using the KEY keyword, a relational operator (one of =, >, >=, <, and <=) and a term value.