Reporting on HBase Data
How to create a report that sources data from HBase.
In order to report on HBase data in a parameterized fashion, you will need to perform the following steps:
- Create a PDI Transformation that sources the picklist of distinct IP Address to be used for a report parameter
- Create a PDI Transformation that sources a set of weblog data for a selected IP Address. This will be the primary data source for the report
- Create a Report that uses the PDI transformations for parameter list and report data.
Prerequisites
In order follow along with this how-to guide you will need the following:
- Hadoop
- Pentaho Data Integration
- HBase
- Pentaho Report Designer
- Loading Data into HBase guide completed
Sample Files
There are no sample files for this guide. The Loading Data into HBase Guide must be completed prior to starting this guide as it loads the sample HBase table data.
Step-By-Step Instructions
Setup
Start Hadoop if it is not already running.
Start HBase if it is not already running.
Create a HBase Parameter Picklist PDI Transformation
In this task you will create a PDI transformation to get a list of distinct IP Addresses from HBase. This transformation will later be used as the data source for a report parameter.
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
Speed Tip
You can download the Kettle Transform hbase_ip_list.ktr already completed
- Add a HBase Input Step: You are going to read data from a HBase table, so expand the 'Big Data' section of the Design palette and drag a 'HBase Input' node onto the transformation canvas. Your transformation should look like:
- Edit the HBase Input Step: Double-click on the 'HBase Input' node to edit its properties. Do the following:
- Zookeeper host(s) and Zookeeper port: Enter your Zookeeper connection information. For local single node clusters use 'localhost' and port '2181'.
- Create a Key Only Mapping: Switch to the 'Create/Edit mappings' tab and enter the following.
- HBase table name: Select 'weblogs'
- Mapping name: Enter 'key_only'
- Alias: Enter 'key'
- Key: Select 'Y'
- Type: Select 'String'
- Click Save Mapping
When you are done your window should look like this:
- Configure Query: Switch to the 'Configure Query' tab and do the following.
- HBase table name: Click the 'Get mapped table names' button, then select 'weblogs'.
- Mapping name: Click the 'Get mappings for the specified table' button, then select 'key_only'.
- Click the 'Get Key/Fields Info' button.
When you are done your window should look like:
Click 'OK' to close the window.
- Zookeeper host(s) and Zookeeper port: Enter your Zookeeper connection information. For local single node clusters use 'localhost' and port '2181'.
- Add a Split Fields Step: You need to split the key field currently formatted as client_ip|year into two separate fields, so expand the 'Transform' section of the Design palette and drag a 'Split Fields' node onto the transformation canvas. Your transformation should look like:
- Connect the Input and Split Fields steps: Hover the mouse over the 'HBase Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Split Fields' node. Your canvas should look like this:
- Edit the Split Fields Step: Double-click on the 'Split Fields' node to edit its properties. Do the following:
- Field to split: Select 'key'.
- Delimiter: Enter '|'
- Fields:
New field
Type
Client_ip
String
Year
Integer
When you are done your window should look like:
Click 'OK' to close the window.
- Add a Sort Rows Step: You need to sort the HBase data, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:
- Connect the Split Fields and Sort steps: Hover the mouse over the 'Split Fields' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. Your canvas should look like this:
- Edit the Sort Step: Double-click on the 'Sort rows' node to edit its properties. Enter this information:
- Check 'Only pass unique rows? (verifies keys only)'
- Fields: Add 'client_ip' sorted in ascending order.
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Dummy Step: You need a component for the report to select it's data from, so expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Your transformation should look like:
- Connect the Sort and Dummy steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Dummy (do nothing)' node. Your canvas should look like this:
- Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node to edit its properties. Set the Step name to 'Output'. When you are done your window should look like:
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hbase_ip_list.ktr' into a folder of your choice.
Create a HBase Query PDI Transformation
In this task you will create a PDI transformation to get the pageviews per month for a particular IP address from HBase. This transformation will be used as the primary data source for the report.
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
Speed Tip
You can download the Kettle Transform hbase_to_report.ktr already completed
- Add a Parameter: Right click on the transformation and select 'Transformation settings'.
Then do the following:- Switch to the 'Parameters' tab.
- Parameter: 'paramIPAddress'
- Default Value: '127.0.0.1'
When you are done your window should look like:
Click 'OK' to close the window.
- Add a HBase Input Step: You are going to read data from a HBase table, so expand the 'Big Data' section of the Design palette and drag a 'HBase Input' node onto the transformation canvas. Your transformation should look like:
- Edit the HBase Input Step: Double-click on the 'HBase Input' node to edit its properties. Do the following:
- Zookeeper host(s) and Zookeeper port: Enter your Zookeeper connection information. For local single node clusters use 'localhost' and port '2181'.
- HBase table name: Click the 'Get mapped table names' button, then select 'weblogs'.
- Mapping name: Click the 'Get mappings for the specified table' button, then select 'pageviews'.
- Click the 'Get Key/Fields Info' button.
When you are done your window should look like:
Click 'OK' to close the window.
- Add a Split Fields Step: You need to split the key field currently formatted as client_ip|year into two separate fields, so expand the 'Transform' section of the Design palette and drag a 'Split Fields' node onto the transformation canvas. Your transformation should look like:
- Connect the Input and Split Fields steps: Hover the mouse over the 'HBase Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Split Fields' node. Your canvas should look like this:
- Edit the Split Fields Step: Double-click on the 'Split Fields' node to edit its properties. Do the following:
- Field to split: Select 'key'.
- Delimiter: Enter '|'
- Fields:
New field
Type
Client_ip
String
Year
Integer
When you are done your window should look like:
Click 'OK' to close the window.
- Add a Get Variables Step: You need to add the parameter you created earlier to your stream, so expand the 'Job' section of the Design palette and drag a 'Get Variables' node onto the transformation canvas. Your transformation should look like:
- Connect the Split Fields and Get Variables steps: Hover the mouse over the 'Split Fields' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Get Variables' node. Your canvas should look like this:
- Edit the Get Variables Step: Double-click on the 'Get Variables' node to edit its properties. Enter this information:
- Name: Enter 'selectedIP'
- Variable: Enter '${paramIPAddress}'
- Type: Select 'String'.
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Filter rows Step: You want to filter for only the rows that match the selected IP Address, so expand the 'Flow' section of the Design palette and drag a 'Filter rows' node onto the transformation canvas. Your transformation should look like:
- Connect the Get Variables and Filter Rows steps: Hover the mouse over the 'Get Variables' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Filter rows' node. Your canvas should look like this:
- Edit the Filter Rows Step: Double-click on the 'Filter Rows' node to edit its properties. Do the following:
- Click the <field> box to the left of the = and select 'client_ip'
- Click the <field> box to the right of the = and select 'selectedIP'
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Sort Rows Step: You want the rows in sorted order by year, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:
- Connect the Filter Rows and Sort Rows steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. When you release the mouse and a tooltip appears select the 'Result is TRUE' option. Your canvas should look like this:
- Edit the Sort Rows Step: Double-click on the 'Sort rows' node to edit its properties. Do the following:
- Fieldname: Enter 'year'
- Ascending: Select 'Y'
When you are done your window should look like this:
Click 'OK' to close the window.
- Add two Dummy Steps: Expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Repeat to add a second 'Dummy (do nothing)' step. Your transformation should look like:
- Connect the Sort Rows and Dummy steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the first 'Dummy (do nothing)' node. Your canvas should look like this:
- Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node you just connected to edit its properties. Change the Step name to 'Output'.
- Connect the Filter Rows and Dummy steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the second 'Dummy (do nothing)' node. When you release the mouse and a tooltip appears select the 'Result is FALSE' option. Your canvas should look like this:
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hbase_to_report.ktr' into a folder of your choice.
Create a Report
In this task you will create a report that uses PDI transformations that extract data from HBase as the source.
- Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
Speed Tip
You can download the Report definition HBase to Report.prpt already completed
- Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:
Click 'Next' to go to the next screen.
- Create a Data Source: You need to create the Hive query to select data for this report, so click the plus button and do the following:
- Choose Type 'Pentaho Data Integration'
- Create a New Query: Click the button to create a new query and do the following:
- Name: Enter 'Page Views'
- File: Select the hbase_to_report.ktr transformation you just created.
- Steps: Select 'Output'
When you are done your window should look like:
- Create a Second Query: Click the button to create a new query and do the following:
- Name: Enter 'IP List'
- File: Select the hbase_ip_list.ktr transformation you just created.
- Steps: Select 'Output'
When you are done your window should look like:
Click 'OK' to close the window.
- Highlight the 'Page Views' query in the Report Design Wizard window.
Your Report Design Wizard window should now look like:
Click 'Next' to go to the next screen.
- Choose Type 'Pentaho Data Integration'
- Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
- Selected Items: Add the following in this order
- Year
- Jan
- Feb
- Mar
- Apr
- May
- Jun
- Jul
- Aug
- Sep
- Oct
- Nov
- Dec
When you are done your window should look like:
Click 'Finish' to complete to the design wizard.
- Selected Items: Add the following in this order
- Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
- Select the 'Pentaho Data Integration' data source.
- Name: Enter 'paramIPAddress'.
- Label: Enter 'IP Address'
- Value Type: Select 'String'
- Check 'Mandatory'
- Display Type: 'Drop Down'
- Query: Select 'IP List'
- Value: Select 'client_ip'
- Display Name: Select 'client_ip'
When you are done your window should look like:
Click 'OK' to close the window.
- Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'Pentaho Data Integration', right click on 'Page Views' and select 'Edit Query'.
Then do the following:- Click on the 'Edit Parameter' button.
- Under 'Transformation Parameter' click the .
- DataRow Column: Select 'paramIPAddress'
- Transfomration Parameter: Select 'paramIPAddress'
When you are done your window should look like:
Click 'OK' to close the window.
- Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hbase_report.prpt' into a folder of your choice.
- Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:
- Do any desired formatting to the report.
Summary
In this guide you learned how to construct a report that sources data from HBase.