Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This

...

section

...

contains

...

a

...

series

...

of

...

How-Tos

...

that

...

demonstrate

...

the

...

integration

...

between

...

Pentaho

...

and

...

Hadoop

...

using

...

a

...

sample

...

weblog

...

dataset.

...

The

...

how-tos

...

are

...

organized

...

by

...

topic

...

with

...

each

...

set

...

explaining

...

various

...

techniques

...

for

...

loading,

...

transforming,

...

extracting

...

and

...

reporting

...

on

...

data

...

within

...

a

...

Hadoop

...

cluster.

...

You

...

are

...

encouraged

...

to

...

perform

...

the

...

how-tos

...

in

...

order

...

as

...

the

...

output

...

of

...

one

...

is

...

sometimes

...

used

...

as

...

the

...

input

...

of

...

another.

...

However,

...

if

...

you

...

would

...

like

...

to

...

jump

...

to

...

a

...

how-to

...

in

...

the

...

middle

...

of

...

the

...

flow,

...

instructions

...

for

...

preparing

...

input

...

data

...

are

...

provided.

...

Hadoop

...

Topics

...

Page Tree
rootBAD:@self
expandCollapseAlltrue
startDepth2
sortposition
excerpttrue
reversefalse

Pre-Requisites

In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g.

...

HBase,

...

Hive,

...

ReportDesigner),

...

specific

...

component

...

requirements

...

will

...

be

...

identified

...

within

...

each

...

how-to.

...

This

...

section

...

enumerates

...

all

...

of

...

the

...

components

...

with

...

some

...

additional

...

configuration

...

and

...

installation

...

tips.

...

Hadoop

A single-node

...

local

...

cluster

...

is

...

sufficient

...

for

...

these

...

exercises

...

but

...

a

...

larger

...

and/or

...

remote

...

configuration

...

will

...

also

...

work.

...

You

...

will

...

need

...

to

...

know

...

the

...

addresses

...

and

...

ports

...

for

...

Hadoop.

...

A

...

nice

...

example

...

how

...

to

...

set

...

up

...

a

...

single

...

node:

...

http://hadoop.apache.org/docs/stable/single_node_setup.html

...

Pentaho Data Integration

PDI will be the primary development environment for the how-tos.

...

You

...

will

...

need

...

version

...

4.3

...

or

...

above.

...

You

...

can

...

find

...

instructions

...

to

...

install

...

PDI

...

for

...

Hadoop

...

in

...

the

...

Configure

...

Pentaho

...

for

...

Cloudera

...

and

...

Other

...

Hadoop

...

Versions

...

guide.

...

Pentaho

...

Hadoop

...

Distribution

...

A

...

Hadoop

...

node

...

distribution

...

of

...

the

...

Pentaho

...

Data

...

Integration

...

(PDI)

...

tool.

...

Pentaho

...

Hadoop

...

Distribution

...

(referred

...

to

...

as

...

PHD

...

from

...

this

...

point

...

on)

...

allows

...

you

...

to

...

execute

...

Pentaho

...

MapReduce

...

jobs

...

on

...

the

...

Hadoop

...

cluster.

...

You

...

can

...

find

...

instructions

...

to

...

download

...

and

...

install

...

the

...

software

...

here:

...

Configure

...

Pentaho

...

for

...

Cloudera

...

and

...

Other

...

Hadoop

...

Versions

Pentaho Report Designer

A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.

You can find instructions to download and install report designer in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.

Hive

A supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to Hadoop data.

You can find a Hive Getting Started guide here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted

...

HBase

A MapR supported version of HBase. HBase is a NoSQL database that leverages Hadoop storage.

Sample Data

The how-to's

...

in

...

this

...

guide

...

were

...

built

...

with

...

sample

...

weblog

...

data.

...

The

...

following

...

files

...

which

...

are

...

used

...

and/or

...

generated

...

by

...

the

...

how-to's

...

in

...

this

...

guide.

...

Each

...

specific

...

how-to

...

will

...

explain

...

which

...

file(s)

...

it

...

requires.

...

File

...

Name

...

...

Unparsed,

...

raw

...

weblog

...

...

Tab-delimited,

...

parsed

...

weblog

...

...

Tab-delimited,

...

aggregated

...

weblog

...

data

...

for

...

a

...

Hive

...

weblogs_agg

...

...

Tab-delimited,

...

aggregated

...

weblog

...

...

Prepared

...

data

...

for

...

HBase

...

load

...