Wiki Markup
{scrollbar} {

Excerpt

...


How

...

to

...

use

...

a

...

PDI

...

transformation

...

to

...

extract

...

data

...

from

...

HDFS

...

and

...

load

...

it

...

into

...

a

...

RDBMS

...

table.

...

The

...

new

...

RDBMS

...

table

...

will

...

contain

...

the

...

count

...

of

...

page

...

views

...

by

...

IP

...

address

...

and

...

month.

...

:=}

Info

title

	Note

For

brevity's

sake,

this

transformation

will

only

contain

two

steps:

Hadoop

File

Input

and

a

Table

Output.

In

practice,

the

full

expressiveness

of

the

PDI

transformation

semantic

is

available.

Further,

PDI

supports

bulk

loading

many

RDBMS

and

that

would

be

a

viable,

and

common,

alternative

to

using

a

Table

Output

approach.

{info} h1. Prerequisites In order to follow along with this

Prerequisites

In order to follow along with this how-to

...

guide

...

you

...

will

...

need

...

the

...

following:

...

Hadoop
Pentaho Data Integration
RDBMS Database (e.g.

...

MySQL)

...

Sample

...

Files

...

The

...

sample

...

data

...

file

...

needed

...

for

...

this

...

guide

...

is:

...

File

...

Name

...

	Content
weblogs_aggregate.txt.zip

...

Tab-delimited,

...

aggregated

...

weblog

...

data

NOTE:

...

If

...

you

...

have

...

already

...

completed

...

the

...

"Using

...

Pentaho

...

MapReduce

...

to

...

Parse

...

Weblog

...

Data"

...

guide

...

the

...

necessary

...

files

...

will

...

already

...

be

...

in

...

the

...

proper

...

location.

...

This

...

file

...

should

...

be

...

placed

...

into

...

HDFS

...

at

...

/user/pdi/weblogs/aggregate_mr

...

using

...

the

...

following

...

commands.

...

}

Code Block

hadoop fs -mkdir /user/pdi/weblogs hadoop fs -mkdir /user/pdi/weblogs/aggregate_mr hadoop fs -put weblogs_aggregate.txt /user/pdi/weblogs/aggregate_mr/part-00000{code} h1.

Step-By-Step

...

Instructions

Setup

Start Hadoop if it is not already running.

Include Page

	Create RDBMS Connection
	Create RDBMS Connection

Create a Transformation to Extract Data from HDFS

In this task you will create a transformation to extract data from a HDFS file and load into a RDBMS table.

Tip

title	Speed Tip

You can download the Kettle Transformation hdfs_to_rdbms.ktr

]

already

completed{tip} # Start PDI on your desktop. Once it is running choose 'File' \-> 'New' \-> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option. \\ \\ # *Add a Hadoop File Input Step:* You are going to extract data from HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like: !BAD:Common Images^AddHadoopFileInput.PNG|width=340,height=218!\\ \\ # *Edit the Input* *Step*: Double-click on the 'Hadoop File Input' node to edit its properties. Do the following: ## File or directory: Enter

completed

Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
Add a Hadoop File Input Step: You are going to extract data from HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:
Image Added
Edit the Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Do the following:
1. File or directory: Enter hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr

...

2. Regular

...

1. Expression:

...

1. Enter

...

1. 'part.*'

...

2. Click

...

1. 'Add'

...

1. to

...

1. add

...

1. the

...

1. file

...

1. to

...

1. the

...

1. list

...

1. of

...

1. input

...

1. files.

...

1. When

...

1. you

...

1. are

...

1. done

...

1. your

...

1. window

...

1. should

...

1. look

...

1. like:
  Image Added
Configure Input File Content: You need to tell Pentaho how the input file is formatted, so switch to the 'Content' tab and enter the following:
1. Separator: Clear and press the 'Insert TAB' button.
2. Uncheck 'Header'
3. Format: Select 'Unix'
  When you are done your window should look like:
  Image Added
Define the Fields: You need to tell Pentaho what fields are in the file, so switch to the 'Fields' tab and enter the following:
Name
Type
Length
client_ip
String
20
year
Integer

month_num
Integer

pageviews
Integer

When you are done your window should look like:
Image Added
Click 'OK' to close the window.
Add a Table Output Step: You want to write the values to a RDBMS, so expand the 'Output' section of the Design palette and drag a 'Table Output' node onto the transformation canvas. Your transformation should look like:
Image Added
Connect the Input and Table Output steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Table Output' node. Your canvas should look like this:
Image Added
Edit the Table Output Step: Double-click on the 'Table Output' node to edit its properties. Do the following:
1. Connection: Select 'RDBMS'
2. Target Table: Enter 'aggregate_hdfs'
3. Check 'Truncate table' so you can re-run this transformation.
4. Click the 'SQL' button to create the table in your target database.
  Image Added
5. Click the 'Execute' button to run the SQL.
  Image Added
  The 'Results of the SQL statements' window will appear telling you if the SQL was successfully executed or give you troubleshooting information if the SQL failed.
6. Click 'OK' to close the window.
7. Click 'Close' to close the 'Simple SQL editor' window.
  When you are done your window should look like:
  Image Added
  Click 'OK' to close the window.
Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_to_rdbms.ktr'

...

into

...

a

...

folder

...

of

...

your

...

choice.

...

Run the Transformation: Choose 'Action'

...

->

...

'Run'

...

from

...

the

...

menu

...

system

...

or

...

click

...

on

...

the

...

green

...

run

...

button

...

on

...

the

...

transformation

...

toolbar.

...

A

...

'Execute

...

a

...

transformation'

...

window

...

will

...

open.

...

Click

...

on

...

the

...

'Launch'

...

button.

...

An

...

'Execution

...

Results'

...

panel

...

will

...

open

...

at

...

the

...

bottom

...

of

...

the

...

PDI

...

window

...

and

...

it

...

will

...

show

...

you

...

the

...

progress

...

of

...

the

...

transformation

...

as

...

it

...

runs.

...

After

...

several

...

seconds

...

the

...

transformation

...

should

...

finish

...

successfully:

...

Image Added

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check RDBMS for Aggregated Table

Explore the Database: Choose 'Tools' -> 'Database' -> 'Explore' from the menu system.
Select the Connection: In the 'Make your selection' window select 'RDBMS' and click 'OK'.
Image Added
Preview the Table: Expand RDBMS -> Tables. Right click on 'aggregate_hdfs' and select 'Preview first 100'
Image Added

Summary

During this guide you learned how to create a PDI transformation to load an RDBMS table with data sourced from a HDFS file.

Wiki Markup
{scrollbar}

Versions Compared

Old Version 5

New Version 6

Key

Prerequisites

Sample

Files

Step-By-Step

Instructions

Setup

Create a Transformation to Extract Data from HDFS

Check RDBMS for Aggregated Table

Summary

Name	Type	Length
client_ip	String	20
year	Integer
month_num	Integer
pageviews	Integer

Page Comparison

Versions Compared

Old Version 5

New Version 6

Key

Prerequisites

Sample

Files

Step-By-Step

Instructions

Setup

Create a Transformation to Extract Data from HDFS

Check RDBMS for Aggregated Table

Summary