HBase Output0
This step writes data to an HBase table according to user-defined column metadata.
Configure Connection
This tab contains HBase connection information. You can configure a connection in one of two ways: either via a comma-separated list of hostnames where the zookeeper quorum reside, or via an hbase-site.xml (and, optionally, hbase-default.xml) configuration file. If both zookeeper and HBase XML configuration options are supplied, then the zookeeper takes precedence.
Option |
Definition |
---|---|
Step name |
The name of this step as it appears in the transformation workspace. |
Zookeeper host(s) |
Comma-separated list of hostnames for the zookeeper quorum. |
URL to hbase-site.xml |
Address of the hbase-site.xml file. |
URL to hbase-default.xml |
Address of the hbase-default.xml file. |
HBase table name |
The HBase table to write to. Click Get Mapped Table Names to populate the drop-down list of possible table names. |
Mapping name |
A mapping to decode and interpret column values. Click Get Mappings For the Specified Table to populate the drop-down list of available mappings. |
Disable write to WAL |
Disables writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. Disabling WAL will increase performance. |
Size of write buffer (bytes) |
The size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (in the hbase-default.xml) is 2MB (2097152 bytes), which is the value that will be used if the field is left blank. |
Create/Edit Mappings
This tab creates or edits a mapping for a given HBase table. A mapping simply defines metadata about the values that are stored in the table. Since just about all information is stored as raw bytes in HBase, this allows PDI to decode values and execute meaningful comparisons for column-based result set filtering.names of fields entering the step are expected to match the aliases of fields defined in the mapping. All incoming fields must have a matching counterpart in the mapping. There may be fewer incoming fields than defined in the mapping, but if there are more incoming fields then an error will occur. Furthermore, one of the incoming fields must match the key defined in the mapping.
Option |
Definition |
---|---|
HBase table name |
Displays a list of table names. Connection information in the previous tab must be valid and complete in order for this drop-down list to populate. |
Mapping name |
Names of any mappings that exist for the table. This box will be empty if there are no mappings defined for the selected table, in which case you can enter the name of a new mapping. |
# |
The order of the mapping operation. |
Alias |
The name you want to assign to the HBase table key. This is required for the table key column, but optional for non-key columns. |
Key |
Indicates whether or not the field is the table's key. |
Column family |
The column family in the HBase source table that the field belongs to. Non-key columns must specify a column family and column name. |
Column name |
The name of the column in the HBase table. |
Type |
Data type of the column. Key columns can be of type: String Integer Unsigned integer (positive only) Long Unsigned long (positive only) Date Unsigned date. Non-key columns can be of type: String, Integer, Long, Float, Double, Boolean, Date, BigNumber, Serializable, Binary. |
Indexed values |
String columns may optionally have a set of legal values defined for them by entering comma-separated data into this field. |
Get incoming fields |
Retrieves a field list using the given HBase table and mapping names. |
Performance Considerations
The Configure connection tab provides a field for setting the size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (defined in the hbase-default.xml file) is 2MB. When left blank, the buffer is 2MB, auto flush is enabled, and Put operations are executed immediately. This means that each row will be transmitted to HBase as soon as it arrives at the step. Entering a number (even if it is the same as the default) for the size of the write buffer will disable auto flush and will result in incoming rows only being transferred once the buffer is full.
There is also a checkbox for disabling writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. However, the tradeoff for error-recovery is speed.
The Create/edit mappings tab has options for creating new tables. In the HBase table name field, you can suffix the name of the new table with parameters for specifying what kind of compression to use, and whether or not to use Bloom filters to speed up lookups. The options for compression are: NONE, GZ and LZO; the options for Bloom filters are: NONE, ROW, ROWCOL. If nothing is selected (or only the name of the new table is defined), then the default of NONE is used for both compression and Bloom filters. For example, the following string entered in the HBase table name field specifies that a new table called "NewTable" should be created with GZ compression and ROWCOL Bloom filters:
NewTable (GZ)ROWCOLDue to licensing constraints, HBase does not ship with LZO compression libraries. These must be manually installed on each node if you want to use LZO compression.