Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Wiki Markup
{scrollbar}
{

Excerpt

...

How

...

to

...

use

...

a

...

PDI

...

transformation

...

to

...

extract

...

data

...

from

...

HDFS

...

and

...

load

...

it

...

into

...

a

...

RDBMS

...

table.

...

The

...

new

...

RDBMS

...

table

...

will

...

contain

...

the

...

count

...

of

...

page

...

views

...

by

...

IP

...

address

...

and

...

month.

...

:=}
Info
title
Note

For

brevity's

sake,

this

transformation

will

only

contain

two

steps:

Hadoop

File

Input

and

a

Table

Output.

In

practice,

the

full

expressiveness

of

the

PDI

transformation

semantic

is

available.

Further,

PDI

supports

bulk

loading

many

RDBMS

and

that

would

be

a

viable,

and

common,

alternative

to

using

a

Table

Output

approach.

{info} h1. Prerequisites In order to follow along with this

Prerequisites

In order to follow along with this how-to

...

guide

...

you

...

will

...

need

...

the

...

following:

...

  • Hadoop
  • Pentaho Data Integration
  • RDBMS Database (e.g.

...

  • MySQL)

...

Sample

...

Files

...

The

...

sample

...

data

...

file

...

needed

...

for

...

this

...

guide

...

is:

...

File

...

Name

...

...

Tab-delimited,

...

aggregated

...

weblog

...

data


NOTE:

...

If

...

you

...

have

...

already

...

completed

...

the

...

"Using

...

Pentaho

...

MapReduce

...

to

...

Parse

...

Weblog

...

Data"

...

guide

...

the

...

necessary

...

files

...

will

...

already

...

be

...

in

...

the

...

proper

...

location.

...


This

...

file

...

should

...

be

...

placed

...

into

...

HDFS

...

at

...

/user/pdi/weblogs/aggregate_mr

...

using

...

the

...

following

...

commands.

...

}
Code Block
hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/aggregate_mr
hadoop fs -put weblogs_aggregate.txt /user/pdi/weblogs/aggregate_mr/part-00000{code}

h1. 

Step-By-Step

...

Instructions

Setup

Start Hadoop if it is not already running.

Include Page
Create RDBMS Connection
Create RDBMS Connection

Create a Transformation to Extract Data from HDFS

In this task you will create a transformation to extract data from a HDFS file and load into a RDBMS table.

Tip
titleSpeed Tip

You can download the Kettle Transformation hdfs_to_rdbms.ktr

]

already

completed{tip} # Start PDI on your desktop. Once it is running choose 'File' \-> 'New' \-> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option. \\ \\ # *Add a Hadoop File Input Step:* You are going to extract data from HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like: !BAD:Common Images^AddHadoopFileInput.PNG|width=340,height=218!\\ \\ # *Edit the Input* *Step*: Double-click on the 'Hadoop File Input' node to edit its properties. Do the following: ## File or directory: Enter

completed

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.

  2. Add a Hadoop File Input Step: You are going to extract data from HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:
    Image Added

  3. Edit the Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Do the following:
    1. File or directory: Enter hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr

...

    1. Regular

...

    1. Expression:

...

    1. Enter

...

    1. 'part.*'

...

    1. Click

...

    1. 'Add'

...

    1. to

...

    1. add

...

    1. the

...

    1. file

...

    1. to

...

    1. the

...

    1. list

...

    1. of

...

    1. input

...

    1. files.

...


    1. When

...

    1. you

...

    1. are

...

    1. done

...

    1. your

...

    1. window

...

    1. should

...

    1. look

...

    1. like:
      Image Added

  1. Configure Input File Content: You need to tell Pentaho how the input file is formatted, so switch to the 'Content' tab and enter the following:
    1. Separator: Clear and press the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'
      When you are done your window should look like:
      Image Added

  2. Define the Fields: You need to tell Pentaho what fields are in the file, so switch to the 'Fields' tab and enter the following:

    Name

    Type

    Length

    client_ip

    String

    20

    year

    Integer

     

    month_num

    Integer

     

    pageviews

    Integer

     


    When you are done your window should look like:
    Image Added
    Click 'OK' to close the window.

  3. Add a Table Output Step: You want to write the values to a RDBMS, so expand the 'Output' section of the Design palette and drag a 'Table Output' node onto the transformation canvas. Your transformation should look like:
    Image Added

  4. Connect the Input and Table Output steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Table Output' node. Your canvas should look like this:
    Image Added

  5. Edit the Table Output Step: Double-click on the 'Table Output' node to edit its properties. Do the following:
    1. Connection: Select 'RDBMS'
    2. Target Table: Enter 'aggregate_hdfs'
    3. Check 'Truncate table' so you can re-run this transformation.
    4. Click the 'SQL' button to create the table in your target database.
      Image Added

    5. Click the 'Execute' button to run the SQL.
      Image Added
      The 'Results of the SQL statements' window will appear telling you if the SQL was successfully executed or give you troubleshooting information if the SQL failed.
    6. Click 'OK' to close the window.
    7. Click 'Close' to close the 'Simple SQL editor' window.
      When you are done your window should look like:
      Image Added
      Click 'OK' to close the window.

  6. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_to_rdbms.ktr'

...

  1. into

...

  1. a

...

  1. folder

...

  1. of

...

  1. your

...

  1. choice.

...



  1. Run the Transformation: Choose 'Action'

...

  1. ->

...

  1. 'Run'

...

  1. from

...

  1. the

...

  1. menu

...

  1. system

...

  1. or

...

  1. click

...

  1. on

...

  1. the

...

  1. green

...

  1. run

...

  1. button

...

  1. on

...

  1. the

...

  1. transformation

...

  1. toolbar.

...

  1. A

...

  1. 'Execute

...

  1. a

...

  1. transformation'

...

  1. window

...

  1. will

...

  1. open.

...

  1. Click

...

  1. on

...

  1. the

...

  1. 'Launch'

...

  1. button.

...

  1. An

...

  1. 'Execution

...

  1. Results'

...

  1. panel

...

  1. will

...

  1. open

...

  1. at

...

  1. the

...

  1. bottom

...

  1. of

...

  1. the

...

  1. PDI

...

  1. window

...

  1. and

...

  1. it

...

  1. will

...

  1. show

...

  1. you

...

  1. the

...

  1. progress

...

  1. of

...

  1. the

...

  1. transformation

...

  1. as

...

  1. it

...

  1. runs.

...

  1. After

...

  1. several

...

  1. seconds

...

  1. the

...

  1. transformation

...

  1. should

...

  1. finish

...

  1. successfully:

...

  1. Image Added

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check RDBMS for Aggregated Table

  1. Explore the Database: Choose 'Tools' -> 'Database' -> 'Explore' from the menu system.

  2. Select the Connection: In the 'Make your selection' window select 'RDBMS' and click 'OK'.
    Image Added

  3. Preview the Table: Expand RDBMS -> Tables. Right click on 'aggregate_hdfs' and select 'Preview first 100'
    Image Added

Summary

During this guide you learned how to create a PDI transformation to load an RDBMS table with data sourced from a HDFS file.

Wiki Markup
{scrollbar}