Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 23 Next »


This section contains a series of How-Tos that demonstrate the integration between Pentaho and Hadoop using a sample weblog dataset.

The how-tos are organized by topic with each set explaining various techniques for loading, transforming, extracting and reporting on data within a Hadoop cluster. You are encouraged to perform the how-tos in order as the output of one is sometimes used as the input of another. However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided.

Unknown macro: {composition-setup}
Unknown macro: {deck}
Unknown macro: {card}

The first three videos compare using Pentaho Kettle to create and execute a simple MapReduce job with using Java to solve the same problem. The Kettle transform shown here runs as a Mapper and Reducer within the cluster.

Unknown macro: {youtube}

KZe1UugxXcs

Unknown macro: {card}

What would the same task as "1) Pentaho MapReduce with Kettle" look like if you coded it in Java? At a half hour long, you may not want to watch the entire video...

Unknown macro: {youtube}

cfFq1XB4kww

Unknown macro: {card}

This is a quick summary of the previous two videos, "1) Pentaho MapReduce with Kettle" and "2) Straight Java", and why Pentaho Kettle boosts productivity and maintainability.

Unknown macro: {youtube}

ZnyuTICOrhk

Unknown macro: {card}

A quick example of loading into the Hadoop Distributed File System (HDFS) using Pentaho Kettle.

Unknown macro: {youtube}

Ylekzmd6TAc

Unknown macro: {card}

A quick example of extracting data from the Hadoop Distributed File System (HDFS) using Pentaho Kettle.

Unknown macro: {youtube}

3Xew58LcMbg

Hadoop Topics

Pre-Requisites

In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to. This section enumerates all of the components with some additional configuration and installation tips.

Hadoop

A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for Hadoop. A nice example how to set up a single node: http://hadoop.apache.org/docs/stable/single_node_setup.html

Pentaho Data Integration

PDI will be the primary development environment for the how-tos. You will need version 4.3 or above. You can find instructions to install PDI for Hadoop in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.

Pentaho Hadoop Distribution

A Hadoop node distribution of the Pentaho Data Integration (PDI) tool. Pentaho Hadoop Distribution (referred to as PHD from this point on) allows you to execute Pentaho MapReduce jobs on the Hadoop cluster.

You can find instructions to download and install the software here: Configure Pentaho for Cloudera and Other Hadoop Versions

Pentaho Report Designer

A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.

You can find instructions to download and install report designer in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.

Hive

A supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to Hadoop data.

You can find a Hive Getting Started guide here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted

HBase

A MapR supported version of HBase. HBase is a NoSQL database that leverages Hadoop storage.

Sample Data

The how-to's in this guide were built with sample weblog data. The following files which are used and/or generated by the how-to's in this guide. Each specific how-to will explain which file(s) it requires.

File Name

Content

weblogs_rebuild.txt.zip

Unparsed, raw weblog data

weblogs_parse.txt.zip

Tab-delimited, parsed weblog data

weblogs_hive.txt.zip

Tab-delimited, aggregated weblog data for a Hive weblogs_agg table

weblogs_aggregate.txt.zip

Tab-delimited, aggregated weblog data

weblogs_hbase.txt.zip

Prepared data for HBase load

  • No labels