...

This

...

section

...

contains

...

a

...

series

...

of

...

How-Tos

...

that

...

demonstrate

...

the

...

integration

...

between

...

Pentaho

...

and

...

Hadoop

...

using

...

a

...

sample

...

weblog

...

dataset.

...

The

...

how-tos

...

are

...

organized

...

by

...

topic

...

with

...

each

...

set

...

explaining

...

various

...

techniques

...

for

...

loading,

...

transforming,

...

extracting

...

and

...

reporting

...

on

...

data

...

within

...

a

...

Hadoop

...

cluster.

...

You

...

are

...

encouraged

...

to

...

perform

...

the

...

how-tos

...

in

...

order

...

as

...

the

...

output

...

of

...

one

...

is

...

sometimes

...

used

...

as

...

the

...

input

...

of

...

another.

...

However,

...

if

...

you

...

would

...

like

...

to

...

jump

...

to

...

a

...

how-to

...

in

...

the

...

middle

...

of

...

the

...

flow,

...

instructions

...

for

...

preparing

...

input

...

data

...

are

...

provided.

...

Hadoop

...

Topics

...

Page Tree
root BAD:@self
expandCollapseAll true
startDepth 2
sort position
excerpt true
reverse false

Pre-Requisites

In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g.

...

HBase,

...

Hive,

...

ReportDesigner),

...

specific

...

component

...

requirements

...

will

...

be

...

identified

...

within

...

each

...

how-to.

...

This

...

section

...

enumerates

...

all

...

of

...

the

...

components

...

with

...

some

...

additional

...

configuration

...

and

...

installation

...

tips.

...

Hadoop

A single-node

...

local

...

cluster

...

is

...

sufficient

...

for

...

these

...

exercises

...

but

...

a

...

larger

...

and/or

...

remote

...

configuration

...

will

...

also

...

work.

...

You

...

will

...

need

...

to

...

know

...

the

...

addresses

...

and

...

ports

...

for

...

Hadoop.

...

A

...

nice

...

example

...

how

...

to

...

set

...

up

...

a

...

single

...

node:

...

http://hadoop.apache.org/docs/stable/single_node_setup.html

...

Pentaho Data Integration

PDI will be the primary development environment for the how-tos.

...

You

...

will

...

need

...

version

...

4.3

...

or

...

above.

...

You

...

can

...

find

...

instructions

...

to

...

install

...

PDI

...

for

...

Hadoop

...

in

...

the

...

...

...

...

...

...

...

...

...

guide.

...

Pentaho

...

Hadoop

...

Distribution

...

A
...
Hadoop
...
node
...
distribution
...
of
...
the
...
Pentaho
...
Data
...
Integration
...
(PDI)
...
tool.
...
Pentaho
...
Hadoop
...
Distribution
...
(referred
...
to
...
as
...
PHD
...
from
...
this
...
point
...
on)
...
allows
...
you
...
to
...
execute
...
Pentaho
...
MapReduce
...
jobs
...
on
...
the
...
Hadoop
...
cluster.
...
You
...
can
...
find
...
instructions
...
to
...
download
...
and
...
install
...
the
...
software
...
here:
...
Configure
...
Pentaho
...
for
...
Cloudera
...
and
...
Other
...
Hadoop
...
Versions

Pentaho Report Designer

A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.

You can find instructions to download and install report designer in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.

Hive

A supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to Hadoop data.

You can find a Hive Getting Started guide here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted

...

HBase

A MapR supported version of HBase. HBase is a NoSQL database that leverages Hadoop storage.

Sample Data

The how-to's

...

in

...

this

...

guide

...

were

...

built

...

with

...

sample

...

weblog

...

data.

...

The

...

following

...

files

...

which

...

are

...

used

...

and/or

...

generated

...

by

...

the

...

how-to's

...

in

...

this

...

guide.

...

Each

...

specific

...

how-to

...

will

...

explain

...

which

...

file(s)

...

it

...

requires.

...

File

...

Name

...

	Content
weblogs_rebuild.txt.zip

...

Unparsed,

...

raw

...

weblog

...

data

weblogs_parse.txt.zip

...

Tab-delimited,

...

parsed

...

weblog

...

data

weblogs_hive.txt.zip

...

Tab-delimited,

...

aggregated

...

weblog

...

data

...

for

...

a

...

Hive

...

weblogs_agg

...

table

weblogs_aggregate.txt.zip

...

Tab-delimited,

...

aggregated

...

weblog

...

data

weblogs_hbase.txt.zip

...

Prepared

...

data

...

for

...

HBase

...

load

...

Versions Compared

Old Version 16

New Version 17

Key

Hadoop

Topics

Page Tree
root BAD:@self
expandCollapseAll true
startDepth 2
sort position
excerpt true
reverse false

Pre-Requisites

Hadoop

Pentaho Data Integration

Pentaho

Hadoop

Distribution

Pentaho Report Designer

Hive

HBase

Sample Data

Page Comparison

Versions Compared

Old Version 16

New Version 17

Key

Hadoop

Topics

Page TreerootBAD:@selfexpandCollapseAlltruestartDepth2sortpositionexcerpttruereversefalse

Pre-Requisites

Hadoop

Pentaho Data Integration

Pentaho

Hadoop

Distribution

Pentaho Report Designer

Hive

HBase

Sample Data

Page Tree
root BAD:@self
expandCollapseAll true
startDepth 2
sort position
excerpt true
reverse false