Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Wiki Markup
{scrollbar}
{

Excerpt

...

How

...

to

...

use

...

a

...

PDI

...

transformation

...

that

...

sources

...

data

...

from

...

a

...

flat

...

file

...

and

...

writes

...

to

...

an

...

HBase

...

table.

...

Info
titleNote

For brevity's

sake,

we

will

use

a

prepared

dataset

and

a

simple

transform.

In

practice,

you

have

and

will

use

the

full

power

of

the

PDI

transformation

semantic

to

transform

and

prepare

your

data

for

HBase

loads.

{info}

The

...

data

...

we

...

will

...

be

...

loading

...

contains

...

pageview

...

counts

...

by

...

IP

...

Address,

...

Year

...

and

...

Month.

...

The

...

HBase

...

table's

...

key

...

will

...

be

...

a

...

concatenation

...

of

...

IP

...

Address

...

and

...

Year.

...

One

...

column

...

will

...

exist

...

per

...

month

...

in

...

the

...

year

...

containing

...

the

...

pageview

...

count

...

for

...

the

...

key.

...

Prerequisites

In order to follow along with this how-to

...

guide

...

you

...

will

...

need

...

the

...

following:

...

  • MapR
  • Pentaho Data Integration
  • HBase

Sample Files

The sample data file needed for this guide is

Anchor
_GoBack
_GoBack
:

File Name

Content

weblogs_hbase.txt.zip

...

Prepared

...

data

...

for

...

Hbase

...

load

Step-By-Step

...

Instructions

Setup

Start MapR if it is not already running.

Create a HBase Table

  1. Open the HBase Shell: Open the HBase shell by entering 'hbase shell' at the command line.
  2. Create the Table in HBase: Enter the following in the HBase shell.
    Code Block
    create 'weblogs', 'pageviews'

...

  1. This

...

  1. creates

...

  1. the

...

  1. weblogs

...

  1. table

...

  1. with

...

  1. a

...

  1. single

...

  1. column

...

  1. family

...

  1. named

...

  1. pageviews.

...

  1. Close

...

  1. the

...

  1. HBase

...

  1. Shell

...

  1. :

...

  1. You

...

  1. are

...

  1. done

...

  1. with

...

  1. the

...

  1. HBase

...

  1. Shell

...

  1. for

...

  1. now,

...

  1. so

...

  1. close

...

  1. it

...

  1. by

...

  1. entering

...

  1. 'quit'.

...


Create a Transformation to Load Data into HBase

In this task you will load a file into MapR's CLDB.

Tip
titleSpeed Tip

You can download the Kettle Transformation load_hbase.ktr already completed

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.

  2. Add an Input Step: You need to tell PDI where to get the data from, so expand the 'Input' section of the Design palette and drag a 'Text File Input' step onto the transformation canvas.

    Image Added

  3. Edit the Input Step: Double-click on the 'Text File Input' node to edit its properties. Enter this information:
    1. File or Directory: Browse to the weblog_hbase.txt file.
    2. Click the 'Add' button.
      When you are done your window should look like:
      Image Added

  4. Configure File Content: Switch to the 'Content' tab and do the following:
    1. Separator: Clear and click the 'Insert TAB' button.
    2. Header: Check the 'Header' checkbox
    3. Format: Select 'Unix'
      When you are done your window should look like:
      Image Added

  5. Configure the Input Fields: Switch to the 'Fields' tab and do the following:
    1. Click the 'Get Fields' Button
    2. When prompted for 'Number of sample lines' use 100 and click 'OK'
    3. Change the 'Type' for the 'key' field to 'String' and the 'Length' to 20.
      When you are done your window should look like:
      Image Added
      Click 'OK' to close the window.

  6. Add a HBase Output Step: You are going to store your data in HBase, so expand the 'Big Data' section of the Design palette and drag a 'HBase Output' node onto the transformation canvas. Your transformation should look like:
    Image Added

  7. Connect the Input and Output steps: Hover the mouse over the 'Text file input' node and a tooltip will appear. Image Added Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'HBase Output' node. Your canvas should look like:
    Image Added

  8. Edit the HBase Output Step: Double-click on the 'HBase Output' node to edit its properties. Enter this information:
    1. Zookeeper host(s): A comma separated list of your HBase Zookeeper Hosts. For local single node clusters use 'localhost'.
    2. Zookeeper port: The port for your Zookeeper hosts. By default this is '5181'.
      When you are done your window should look like:
      Image Added

  9. Create a HBase Mapping: You need to tell Pentaho how to store the data in HBase, so switch to the 'Create/Edit mappings' tab and do the following:
    1. HBase table name: Select 'weblogs'.
    2. Mapping name: Enter 'pageviews'.
    3. Click the 'Get incoming fields' button.
    4. For the alias 'key' change the 'Key' column to 'Y', empty the 'Column family' and 'Column name' fields and set the 'Type' field to 'String'
    5. Click the 'Save mapping' button.
      When you are done your window should look like:
      Image Added

  10. Finish Configuring the Connection: You need to tell the HBase output to use the mapping you just created, so switch back to the 'Configure connection' tab and do the following:
    1. Click the 'Get table names' button.
    2. HBase table name: Select 'weblogs'.
    3. Click the 'Get mappings for the specified table' button.
    4. Mapping name: Select 'pageviews'.
      When you are done your window should look like:
      Image Added
      Click 'OK' to close the window.

  11. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hbase.kjb' into a folder of your choice.

  12. Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully: Image Added

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check HBase

  1. Open the HBase Shell: Open the HBase shell so you can check that your table loaded by entering 'hbase shell' at the command line.
  2. Scan the Table: You want to scan the table to ensure data loaded, so run the following command.
    Code Block
    scan 'weblogs'
  3. Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit' in the HBase Shell.

Summary

During this guide you learned how to load HBase using PDI. Other guides in this series cover how to get data out of HBase, and report on data in HBase.

Wiki Markup
{scrollbar}