Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Wiki Markup
{scrollbar}
{excerpt}How to use a PDI transformation that sources data from a flat file and writes to an HBase table.{excerpt}
{info:title=Note} For brevity's sake, we will use a prepared dataset and a simple transform.   In practice, you have and will use the full power of the PDI transformation semantic to transform and prepare your data for HBase loads.{info}
The data you will be loading contains pageview counts by IP Address, Year and Month.   The HBase table's key will be a concatenation of IP Address and Year.   One column will exist per month in the year containing the pageview count for the key.

h1. Prerequisites

In order to follow along with this how-to guide you will need the following:
* Hadoop
* Pentaho Data Integration
* HBase

h1. Sample Files

The sample data file needed for this guide is{anchor:_GoBack}:
| File Name | Content |
| [How To's^weblogs_hbase.txt.zip] | Prepared data for Hbase load |

h1. Step-By-Step Instructions


h2. Setup

Start Hadoop if it is not already running.
Start HBase if it is not already running.

h2. Create a HBase Table

# *Open the HBase Shell*: Open the HBase shell by entering 'hbase shell' at the command line.
# *Create the Table in HBase:* Enter the following in the HBase shell.{code}
Excerpt

Using a PDI transformation that sources data from a flat file and writes to an HBase table.

Info
titleNote

For brevity's sake, we will use a prepared dataset and a simple transformation. In practice, you have and will use the full power of the PDI transformation functionality to transform and prepare your data for HBase loads.

The data you will be loading contains pageview counts by IP Address, Year and Month. The HBase table's key will be a concatenation of IP Address and Year. One column will exist per month in the year containing the pageview count for the key.

Prerequisites

In order to follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • HBase

Sample Files

The sample data file needed for this guide is

Anchor
_GoBack
_GoBack
:

File Name

Content

weblogs_hbase.txt.zip

Prepared data for Hbase load

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.
Start HBase if it is not already running.

Create a HBase Table

  1. Open the HBase Shell: Connect to the HBase shell by entering 'hbase shell' via ssh terminal.
  2. Create the Table in HBase: Enter the following in the HBase shell.

    Code Block
    create 'weblogs', 'pageviews'

...

  1. This

...

  1. creates

...

  1. the

...

  1. weblogs

...

  1. table

...

  1. with

...

  1. a

...

  1. single

...

  1. column

...

  1. family

...

  1. named

...

  1. pageviews.

...

  1. Close

...

  1. the

...

  1. HBase

...

  1. Shell

...

  1. : Type "quit" to exit the hbase shell. 


Create a Transformation to Load Data into HBase

In this task you will load a file into HBase.

Tip
titleSpeed Tip

Downloading the Kettle Transformation load_hbase.ktr will save time as it is already configured to load the HBase data.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
  2. Add an Input Step: Expand the 'Input' section of the Design palette and drag a 'Text File Input' step onto the transformation canvas.

    Image Added
  3. Edit the Input Step: Double-click on the 'Text File Input' node to edit its properties. Enter this information:
    1. File or Directory: Browse to the weblog_hbase.txt file.
    2. Click the 'Add' button.
      When you are done your window should look like(click to enlarge):
      Image Added
  4. Configure File Content: Switch to the 'Content' tab and do the following:
    1. Separator: Clear and click the 'Insert TAB' button.
    2. Header: Check the 'Header' checkbox
    3. Format: Select 'Unix'
      When you are done your window should look like(click to enlarge):
      Image Added
  5. Configure the Input Fields: Switch to the 'Fields' tab and do the following:
    1. Click the 'Get Fields' Button
    2. When prompted for 'Number of sample lines' use 100 and click 'OK'
    3. Change the 'Type' for the 'key' field to 'String' and the 'Length' to 20.
      When you are done your window should look like(click to enlarge):
      Image Added
      Click 'OK' to close the window.
  6. Add a HBase Output Step: You are going to store your data in HBase, so expand the 'Big Data' section of the Design palette and drag a 'HBase Output' node onto the transformation canvas. Your transformation should look like(click to enlarge):
    Image Added
  7. Connect the Input and Output steps(if they are not already): Hover the mouse over the 'Text file input' node and a tooltip will appear.  Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'HBase Output' node. Your canvas should look like:
    Image Added
  8. Edit the HBase Output Step: Double-click on the 'HBase Output' node to edit its properties. 
    a. Select the cluster in the drop down menu. 
    b. Click on Get table names and select "weblogs" from the drop down. 
    NOTE- If the table/mapping names are not present(the drop down is empty) you will need to create them under the "Create/Edit mappings" tab, and save the mapping. It will then show under the dropdown menu. 

    1. Image Added


  9. Create a HBase Mapping: You need to tell Pentaho how to store the data in HBase, so switch to the 'Create/Edit mappings' tab and do the following:
    1. HBase table name: Select 'weblogs'.
    2. Mapping name: Enter 'pageviews'.
    3. Click the 'Get incoming fields' button.
    4. For the alias 'key' change the 'Key' column to 'Y', empty the 'Column family' and 'Column name' fields and set the 'Type' field to 'String'
    5. Click the 'Save mapping' button.
      When you are done your window should look like:
      Image Added
  10. Finish Configuring the Connection: You need to tell the HBase output to use the mapping you just created, so switch back to the 'Configure connection' tab and do the following:
    1. Click the 'Get table names' button.
    2. HBase table name: Select 'weblogs'.
    3. Click the 'Get mappings for the specified table' button.
    4. Mapping name: Select 'pageviews'.
      When you are done your window should look like:
      Image Added
      Click 'OK' to close the window.
  11. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hbase.ktr' into a folder of your choice.
  12. Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:
    Image Added

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check HBase

  1. Open the HBase Shell: Open the HBase shell so you can check that your table loaded by entering 'hbase shell' at the command line.
  2. Scan the Table: You want to scan the table to ensure data loaded, so run the following command.

    Code Block
    scan 'weblogs', {LIMIT => 10}

...

  1. Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit' in the HBase Shell.

Summary

During this guide you learned how to load HBase using PDI. Other guides in this series cover how to get data out of HBase, and report on data in HBase.

Wiki Markup
{scrollbar}