How to use a PDI transformation that sources data from a flat file and writes to an HBase table.
Note
For brevity's sake, we will use a prepared dataset and a simple transform. In practice, you have and will use the full power of the PDI transformation semantic to transform and prepare your data for HBase loads.
The data we will be loading contains pageview counts by IP Address, Year and Month. The HBase table's key will be a concatenation of IP Address and Year. One column will exist per month in the year containing the pageview count for the key.
Prerequisites
In order follow along with this how-to guide you will need the following:
- MapR
- Pentaho Data Integration
- HBase
Sample Files
The sample data file needed for this guide is:
File Name |
Content |
[weblogs_hbase.txt\|Using Pentaho with MapR^weblogs_hbase.txt.zip|] |
Prepared data for Hbase load |
Step-By-Step Instructions
Setup
Start MapR if it is not already running.
Create a HBase Table
- Open the HBase Shell: Open the HBase shell by entering 'hbase shell' at the command line.
- Create the Table in HBase: Enter the following in the HBase shell.
This creates the weblogs table with a single column family named pageviews.
create 'weblogs', 'pageviews'
- Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit'.
Create a Transformation to Load Data into HBase
In this task you will load a file into MapR's CLDB.
Speed Tip
You can download the Kettle Transformation load_hbase.ktr already completed
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
- Add an Input Step: You need to tell PDI where to get the data from, so expand the 'Input' section of the Design palette and drag a 'Text File Input' step onto the transformation canvas.
- Edit the Input Step: Double-click on the 'Text File Input' node to edit its properties. Enter this information:
- File or Directory: Browse to the weblog_hbase.txt file.
- Click the 'Add' button.
When you are done your window should look like:
- Configure File Content: Switch to the 'Content' tab and do the following:
- Separator: Clear and click the 'Insert TAB' button.
- Header: Check the 'Header' checkbox
- Format: Select 'Unix'
When you are done your window should look like:
- Configure the Input Fields: Switch to the 'Fields' tab and do the following:
- Click the 'Get Fields' Button
- When prompted for 'Number of sample lines' use 100 and click 'OK'
- Change the 'Type' for the 'key' field to 'String' and the 'Length' to 20.
When you are done your window should look like:
Click 'OK' to close the window.
- Add a HBase Output Step: You are going to store your data in HBase, so expand the 'Hadoop' section of the Design palette and drag a 'HBase Output' node onto the transformation canvas. Your transformation should look like:
- Connect the Input and Output steps: Hover the mouse over the 'Text file input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'HBase Output' node. Your canvas should look like:
- Edit the HBase Output Step: Double-click on the 'HBase Output' node to edit its properties. Enter this information:
- Zookeeper host(s): A comma separated list of your HBase Zookeeper Hosts. For local single node clusters use 'localhost'.
- Zookeeper port: The port for your Zookeeper hosts. By default this is '5181'.
When you are done your window should look like:
- Create a HBase Mapping: You need to tell Pentaho how to store the data in HBase, so switch to the 'Create/Edit mappings' tab and do the following:
- HBase table name: Select 'weblogs'.
- Mapping name: Enter 'pageviews'.
- Click the 'Get incoming fields' button.
- For the alias 'key' change the 'Key' column to 'Y', empty the 'Column family' and 'Column name' fields and set the 'Type' field to 'String'
- Click the 'Save mapping' button.
When you are done your window should look like:
- Finish Configuring the Connection: You need to tell the HBase output to use the mapping you just created, so switch back to the 'Configure connection' tab and do the following:
- Click the 'Get table names' button.
- HBase table name: Select 'weblogs'.
- Click the 'Get mappings for the specified table' button.
- Mapping name: Select 'pageviews'.
When you are done your window should look like:
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hbase.kjb' into a folder of your choice.
- Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:
If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.
Check HBase
- Open the HBase Shell: Open the HBase shell so you can check that your table loaded by entering 'hbase shell' at the command line.
- Scan the Table: You want to scan the table to ensure data loaded, so run the following command.
scan 'weblogs'
- Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit' in the HBase Shell.
Summary
During this guide you learned how to load HBase using PDI. Other guides in this series cover how to get data out of HBase, and report on data in HBase.