HBase Input0
This step reads data from an HBase table according to user-defined column metadata.
Configure Query
This tab contains connection details and basic query information. You can configure a connection in one of two ways: either via a comma-separated list of hostnames where the zookeeper quorum reside, or via an hbase-site.xml (and, optionally, hbase-default.xml) configuration file. If both zookeeper and HBase XML configuration options are supplied, then the zookeeper takes precedence.
Option |
Definition |
---|---|
Step name |
The name of this step as it appears in the transformation workspace. |
Zookeeper host(s) |
Comma-separated list of hostnames for the zookeeper quorum. |
URL to hbase-site.xml |
Address of the hbase-site.xml file. |
URL to hbase-default.xml |
Address of the hbase-default.xml file. |
HBase table name |
The source HBase table to read from. Click Get Mapped Table Names to populate the drop-down list of possible table names. |
Mapping name |
A mapping to decode and interpret column values. Click Get Mappings For the Specified Table to populate the drop-down list of available mappings. |
Start key value (inclusive) for table scan |
A starting key value to retrieve rows from. This is inclusive of the value entered. |
Stop key value (exclusive) for table scan |
A stopping key value for the scan. This is exclusive of the value entered. Both fields or the stop key field may be left blank. If the stop key field is left blank, then all rows from (and including) the start key will be returned. |
Scanner row cache size |
The number of rows that should be cached each time a fetch request is made to HBase. Leaving this blank uses the default, which is to perform no caching; one row would be returned per fetch request. Setting a value in this field will increase performance (faster scans) at the expense of memory consumption. |
# |
The order of query limitation fields. |
Alias |
The name that the field will be given in the output stream. |
Key |
Indicates whether the field is the table's key field or not. |
Column family |
The column family in the HBase source table that the field belongs to. |
Column name |
The name of the column in the HBase table (family + column name uniquely identifies a column in the HBase table). |
Type |
The PDI data type for the field. |
Format |
A formatting mask to apply to the field. |
Indexed values |
Indicates whether the field has a predefined set of values that it can assume. |
Get Key/Fields Info |
Assuming the connection information is complete and valid, this button will populate the field list and display the name of the key. |
Create/Edit Mappings
This tab creates or edits a mapping for a given HBase table. A mapping simply defines metadata about the values that are stored in the table. Since most information is stored as raw bytes in HBase, this enables PDI to decode values and execute meaningful comparisons for column-based result set filtering.
Option |
Definition |
---|---|
HBase table name |
Displays a list of table names. Connection information in the previous tab must be valid and complete in order for this drop-down list to populate. |
Mapping name |
Names of any mappings that exist for the table. This box will be empty if there are no mappings defined for the selected table, in which case you can enter the name of a new mapping. |
# |
The order of the mapping operation. |
Alias |
The name you want to assign to the HBase table key. This is required for the table key column, but optional for non-key columns. |
Key |
Indicates whether or not the field is the table's key. |
Column family |
The column family in the HBase source table that the field belongs to. Non-key columns must specify a column family and column name. |
Column name |
The name of the column in the HBase table. |
Type |
Data type of the column. Key columns can be of type: String Integer Unsigned integer (positive only) Long Unsigned long (positive only) Date Unsigned date. Non-key columns can be of type: String, Integer, Long, Float, Double, Boolean, Date, BigNumber, Serializable, Binary. |
Indexed values |
String columns may optionally have a set of legal values defined for them by entering comma-separated data into this field. |
Filter Result Set
This tab provides two fields that limit the range of key values returned by a table scan. Leaving both fields blank will result in all rows being retrieved from the source table.
Option |
Definition |
---|---|
Match all / Match any |
When multiple column filters have been defined, you have the option returning only those rows that match all filters, or any single filter. Bounded ranges on a single numeric column can be defined by defining two filters (upper and lower bounds) and selecting Match all; similarly, open-ended ranges can be defined by selecting Match any. |
# |
The order of the filter operation. |
Alias |
A drop-down box of column alias names from the mapping. |
Type |
Data type of the column. This is automatically populated when you select a field after choosing the alias. |
Operator |
A drop-down box that contains either equality/inequality operators for numeric, date, and boolean fields; or substring and regular expression operators for string fields. |
Comparison value |
A comparison constant to use in conjunction with the operator. |
Format |
A formatting mask to apply to the field. |
Signed comparison |
Specifies whether or not the comparison constant and/or field values involve negative numbers (for non-string fields only). If field values and comparison constants are only positive for a given filter, then HBase's native lexicographical byte-based comparisons are sufficient. If this is not the case, then it is necessary for column values to be deserialized from bytes to actual numbers before performing the comparison. |
Performance Considerations
Specifying fields in the Configure query tab will result in scans that return just those columns. Since HBase is a sparse column-oriented database, this requires that HBase check to see whether each row contains a specific column. More lookups equate to reduced speed, although the use of Bloom filters (if enabled on the table in question) mitigates this to a certain extent. If, on the other hand, the fields table in the Configure query tab is left blank, it results in a scan that returns rows that contain all columns that exist in each row (not only those that have been defined in the mapping). However, the HBase Input step will only emit those columns that are defined in the mapping being used. Because all columns are returned, HBase does not have to do any lookups. However, if the table in question contains many columns and is dense, then this will result in more data being transferred over the network.