...
This
...
section
...
contains
...
a
...
series
...
of
...
How-Tos
...
that
...
demonstrate
...
the
...
integration
...
between
...
Pentaho
...
and
...
Hadoop
...
using
...
a
...
sample
...
weblog
...
dataset.
...
The
...
how-tos
...
are
...
organized
...
by
...
topic
...
with
...
each
...
set
...
explaining
...
various
...
techniques
...
for
...
loading,
...
transforming,
...
extracting
...
and
...
reporting
...
on
...
data
...
within
...
a
...
Hadoop
...
cluster.
...
You
...
are
...
encouraged
...
to
...
perform
...
the
...
how-tos
...
in
...
order
...
as
...
the
...
output
...
of
...
one
...
is
...
sometimes
...
used
...
as
...
the
...
input
...
of
...
another.
...
However,
...
if
...
you
...
would
...
like
...
to
...
jump
...
to
...
a
...
how-to
...
in
...
the
...
middle
...
of
...
the
...
flow,
...
instructions
...
for
...
preparing
...
input
...
data
...
are
...
provided.
...
Hadoop
...
Topics
...
Page Tree | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Pre-Requisites
In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g.
...
HBase,
...
Hive,
...
ReportDesigner),
...
specific
...
component
...
requirements
...
will
...
be
...
identified
...
within
...
each
...
how-to.
...
This
...
section
...
enumerates
...
all
...
of
...
the
...
components
...
with
...
some
...
additional
...
configuration
...
and
...
installation
...
tips.
...
Hadoop
A single-node
...
local
...
cluster
...
is
...
sufficient
...
for
...
these
...
exercises
...
but
...
a
...
larger
...
and/or
...
remote
...
configuration
...
will
...
also
...
work.
...
You
...
will
...
need
...
to
...
know
...
the
...
addresses
...
and
...
ports
...
for
...
Hadoop.
...
A
...
nice
...
example
...
how
...
to
...
set
...
up
...
a
...
single
...
node:
...
http://hadoop.apache.org/docs/stable/single_node_setup.html
...
Pentaho Data Integration
PDI will be the primary development environment for the how-tos.
...
You
...
will
...
need
...
version
...
4.3
...
or
...
above.
...
You
...
can
...
find
...
instructions
...
to
...
install
...
PDI
...
for
...
Hadoop
...
in
...
the
...
...
...
...
...
...
...
...
...
guide.
...
Pentaho
...
Hadoop
...
Distribution
...
A
...
Hadoop
...
node
...
distribution
...
of
...
the
...
Pentaho
...
Data
...
Integration
...
(PDI)
...
tool.
...
Pentaho
...
Hadoop
...
Distribution
...
(referred
...
to
...
as
...
PHD
...
from
...
this
...
point
...
on)
...
allows
...
you
...
to
...
execute
...
Pentaho
...
MapReduce
...
jobs
...
on
...
the
...
Hadoop
...
cluster.
...
You
...
can
...
find
...
instructions
...
to
...
download
...
and
...
install
...
the
...
software
...
here:
...
...
...
...
...
...
...
...
Pentaho Report Designer
A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.
You can find instructions to download and install report designer in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.
Hive
A supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to Hadoop data.
You can find a Hive Getting Started guide here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted
...
HBase
A MapR supported version of HBase. HBase is a NoSQL database that leverages Hadoop storage.
Sample Data
The how-to's
...
in
...
this
...
guide
...
were
...
built
...
with
...
sample
...
weblog
...
data.
...
The
...
following
...
files
...
which
...
are
...
used
...
and/or
...
generated
...
by
...
the
...
how-to's
...
in
...
this
...
guide.
...
Each
...
specific
...
how-to
...
will
...
explain
...
which
...
file(s)
...
it
...
requires.
...
File |
...
Name |
...
Content | |
...
Unparsed, |
...
raw |
...
weblog |
...
...
Tab-delimited, |
...
parsed |
...
weblog |
...
data |
...
Tab-delimited, |
...
aggregated |
...
weblog |
...
data |
...
for |
...
a |
...
Hive |
...
weblogs_agg |
...
...
Tab-delimited, |
...
aggregated |
...
weblog |
...
...
Prepared |
...
data |
...
for |
...
HBase |
...
load |
...