Using a PDI transformation that sources data from a flat file and writes to an HBase table.
Note
For brevity's sake, we will use a prepared dataset and a simple transformation. In practice, you have and will use the full power of the PDI transformation functionality to transform and prepare your data for HBase loads.
The data you will be loading contains pageview counts by IP Address, Year and Month. The HBase table's key will be a concatenation of IP Address and Year. One column will exist per month in the year containing the pageview count for the key.
Prerequisites
In order to follow along with this how-to guide you will need the following:
- Hadoop
- Pentaho Data Integration
- HBase
Sample Files
The sample data file needed for this guide is :
File Name | Content |
Prepared data for Hbase load |
Step-By-Step Instructions
Setup
Start Hadoop if it is not already running.
Start HBase if it is not already running.
Create a HBase Table
- Open the HBase Shell: Connect to the HBase shell by entering 'hbase shell' via ssh terminal.
Create the Table in HBase: Enter the following in the HBase shell.
create 'weblogs', 'pageviews'
This creates the weblogs table with a single column family named pageviews.
- Close the HBase Shell: Type "quit" to exit the hbase shell.
Create a Transformation to Load Data into HBase
In this task you will load a file into HBase.
Speed Tip
Downloading the Kettle Transformation load_hbase.ktr will save time as it is already configured to load the HBase data.
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
- Add an Input Step: Expand the 'Input' section of the Design palette and drag a 'Text File Input' step onto the transformation canvas.
- Edit the Input Step: Double-click on the 'Text File Input' node to edit its properties. Enter this information:
- File or Directory: Browse to the weblog_hbase.txt file.
- Click the 'Add' button.
When you are done your window should look like(click to enlarge):
- Configure File Content: Switch to the 'Content' tab and do the following:
- Separator: Clear and click the 'Insert TAB' button.
- Header: Check the 'Header' checkbox
- Format: Select 'Unix'
When you are done your window should look like(click to enlarge):
- Configure the Input Fields: Switch to the 'Fields' tab and do the following:
- Click the 'Get Fields' Button
- When prompted for 'Number of sample lines' use 100 and click 'OK'
- Change the 'Type' for the 'key' field to 'String' and the 'Length' to 20.
When you are done your window should look like(click to enlarge):
Click 'OK' to close the window.
- Add a HBase Output Step: You are going to store your data in HBase, so expand the 'Big Data' section of the Design palette and drag a 'HBase Output' node onto the transformation canvas. Your transformation should look like(click to enlarge):
- Connect the Input and Output steps(if they are not already): Hover the mouse over the 'Text file input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'HBase Output' node. Your canvas should look like:
- Edit the HBase Output Step: Double-click on the 'HBase Output' node to edit its properties.
a. Select the cluster in the drop down menu.
b. Click on Get table names and select "weblogs" from the drop down.
NOTE- If the table/mapping names are not present(the drop down is empty) you will need to create them under the "Create/Edit mappings" tab, and save the mapping. It will then show under the dropdown menu. - Create a HBase Mapping: You need to tell Pentaho how to store the data in HBase, so switch to the 'Create/Edit mappings' tab and do the following:
- HBase table name: Select 'weblogs'.
- Mapping name: Enter 'pageviews'.
- Click the 'Get incoming fields' button.
- For the alias 'key' change the 'Key' column to 'Y', empty the 'Column family' and 'Column name' fields and set the 'Type' field to 'String'
- Click the 'Save mapping' button.
When you are done your window should look like:
- Finish Configuring the Connection: You need to tell the HBase output to use the mapping you just created, so switch back to the 'Configure connection' tab and do the following:
- Click the 'Get table names' button.
- HBase table name: Select 'weblogs'.
- Click the 'Get mappings for the specified table' button.
- Mapping name: Select 'pageviews'.
When you are done your window should look like:
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hbase.ktr' into a folder of your choice.
- Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:
If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.
Check HBase
- Open the HBase Shell: Open the HBase shell so you can check that your table loaded by entering 'hbase shell' at the command line.
Scan the Table: You want to scan the table to ensure data loaded, so run the following command.
scan 'weblogs', {LIMIT => 10}
- Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit' in the HBase Shell.
Summary
During this guide you learned how to load HBase using PDI. Other guides in this series cover how to get data out of HBase, and report on data in HBase.