Wiki Markup |
---|
{scrollbar}
{ |
Excerpt |
---|
...
How |
...
to |
...
use |
...
a |
...
PDI |
...
transformation |
...
to |
...
extract |
...
data |
...
from |
...
HDFS |
...
and |
...
load |
...
it |
...
into |
...
a |
...
RDBMS |
...
table. |
...
...
new
...
RDBMS
...
table
...
will
...
contain
...
the
...
count
...
of
...
page
...
views
...
by
...
IP
...
address
...
and
...
month.
...
Info | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
For brevity's sake, this transformation will only contain two steps: Hadoop File Input and a Table Output. In practice, the full expressiveness of the PDI transformation semantic is available. Further, PDI supports bulk loading many RDBMS and that would be a viable, and common, alternative to using a Table Output approach. {info} h1. Prerequisites In order to follow along with this |
Prerequisites
In order to follow along with this how-to
...
guide
...
you
...
will
...
need
...
the
...
following:
...
- Hadoop
- Pentaho Data Integration
- RDBMS Database (e.g.
...
- MySQL)
...
Sample
...
Files
...
The
...
sample
...
data
...
file
...
needed
...
for
...
this
...
guide
...
is:
...
File |
...
Name |
...
Content | |
...
Tab-delimited, |
...
aggregated |
...
weblog |
...
data |
NOTE:
...
If
...
you
...
have
...
already
...
completed
...
the
...
"Using
...
Pentaho
...
MapReduce
...
to
...
Parse
...
Weblog
...
Data"
...
guide
...
the
...
necessary
...
files
...
will
...
already
...
be
...
in
...
the
...
proper
...
location.
...
This
...
file
...
should
...
be
...
placed
...
into
...
HDFS
...
at
...
/user/pdi/weblogs/aggregate_mr
...
using
...
the
...
following
...
commands.
...
Code Block |
---|
hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/aggregate_mr
hadoop fs -put weblogs_aggregate.txt /user/pdi/weblogs/aggregate_mr/part-00000{code}
h1. |
Step-By-Step
...
Instructions
Setup
Start Hadoop if it is not already running.
Include Page | ||||
---|---|---|---|---|
|
Create a Transformation to Extract Data from HDFS
In this task you will create a transformation to extract data from a HDFS file and load into a RDBMS table.
Tip | ||
---|---|---|
| ||
You can download the Kettle Transformation hdfs_to_rdbms.ktr ]already completed{tip} # Start PDI on your desktop. Once it is running choose 'File' \-> 'New' \-> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option. \\ \\ # *Add a Hadoop File Input Step:* You are going to extract data from HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like: !BAD:Common Images^AddHadoopFileInput.PNG|width=340,height=218!\\ \\ # *Edit the Input* *Step*: Double-click on the 'Hadoop File Input' node to edit its properties. Do the following: ## File or directory: Entercompleted |
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
- Add a Hadoop File Input Step: You are going to extract data from HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:
- Edit the Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Do the following:
- File or directory: Enter hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr
...
- Regular
...
- Expression:
...
- Enter
...
- 'part.*'
...
- Click
...
- 'Add'
...
- to
...
- add
...
- the
...
- file
...
- to
...
- the
...
- list
...
- of
...
- input
...
- files.
...
When
...
- you
...
- are
...
- done
...
- your
...
- window
...
- should
...
- look
...
- like:
- like:
- Configure Input File Content: You need to tell Pentaho how the input file is formatted, so switch to the 'Content' tab and enter the following:
- Separator: Clear and press the 'Insert TAB' button.
- Uncheck 'Header'
- Format: Select 'Unix'
When you are done your window should look like:
- Define the Fields: You need to tell Pentaho what fields are in the file, so switch to the 'Fields' tab and enter the following:
Name
Type
Length
client_ip
String
20
year
Integer
month_num
Integer
pageviews
Integer
When you are done your window should look like:
Click 'OK' to close the window. - Add a Table Output Step: You want to write the values to a RDBMS, so expand the 'Output' section of the Design palette and drag a 'Table Output' node onto the transformation canvas. Your transformation should look like:
- Connect the Input and Table Output steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Table Output' node. Your canvas should look like this:
- Edit the Table Output Step: Double-click on the 'Table Output' node to edit its properties. Do the following:
- Connection: Select 'RDBMS'
- Target Table: Enter 'aggregate_hdfs'
- Check 'Truncate table' so you can re-run this transformation.
- Click the 'SQL' button to create the table in your target database.
- Click the 'Execute' button to run the SQL.
The 'Results of the SQL statements' window will appear telling you if the SQL was successfully executed or give you troubleshooting information if the SQL failed. - Click 'OK' to close the window.
- Click 'Close' to close the 'Simple SQL editor' window.
When you are done your window should look like:
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_to_rdbms.ktr'
...
- into
...
- a
...
- folder
...
- of
...
- your
...
- choice.
...
- Run the Transformation: Choose 'Action'
...
- ->
...
- 'Run'
...
- from
...
- the
...
- menu
...
- system
...
- or
...
- click
...
- on
...
- the
...
- green
...
- run
...
- button
...
- on
...
- the
...
- transformation
...
- toolbar.
...
- A
...
- 'Execute
...
- a
...
- transformation'
...
- window
...
- will
...
- open.
...
- Click
...
- on
...
- the
...
- 'Launch'
...
- button.
...
- An
...
- 'Execution
...
- Results'
...
- panel
...
- will
...
- open
...
- at
...
- the
...
- bottom
...
- of
...
- the
...
- PDI
...
- window
...
- and
...
- it
...
- will
...
- show
...
- you
...
- the
...
- progress
...
- of
...
- the
...
- transformation
...
- as
...
- it
...
- runs.
...
- After
...
- several
...
- seconds
...
- the
...
- transformation
...
- should
...
- finish
...
- successfully:
...
If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.
Check RDBMS for Aggregated Table
- Explore the Database: Choose 'Tools' -> 'Database' -> 'Explore' from the menu system.
- Select the Connection: In the 'Make your selection' window select 'RDBMS' and click 'OK'.
- Preview the Table: Expand RDBMS -> Tables. Right click on 'aggregate_hdfs' and select 'Preview first 100'
Summary
During this guide you learned how to create a PDI transformation to load an RDBMS table with data sourced from a HDFS file.
Wiki Markup |
---|
{scrollbar} |