Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

This section contains how-tos that will get you started with Pentaho if you are using the MapR distribution of Hadoop

Overview

This wiki contains a series of How-Tos that demonstrate the integration between Pentaho and MapR using a sample weblog dataset.   The how-tos are organized by function with each set explaining various techniques for loading, transforming, extracting and reporting on data within a MapR cluster.  You are encouraged to perform the how-tos in order as often the output of one is used as the input of another.    However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided.

Pre-Requisites

In order to perform all of the how-tos in this guide, you will need the following.   Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to.   This section enumerates all of the components with some additional configuration and installation tips.

MapR

A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for MapR.

These guides were developed using the MapR M3 distribution version 1.2. You can find MapR downloads here: http://mapr.com/download

Pentaho Data Integration

PDI will be the primary development environment for the how-tos.    You will need version [TODO]. You can download the software here: [TODO]

Pentaho Hadoop Distribution

A Hadoop node distribution of the Pentaho Data Integration (PDI) tool.  Pentaho Hadoop Distribution (referred to as PHD from this point on) allows you to execute Pentaho MapReduce jobs on the MapR cluster.

You can find instructions to download and install the software here: [TODO]

Pentaho Report Designer

A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.

You must copy all jars from PDI's libext directory and sub folders with the exception of the JDBC folder into Report Designers lib directory.

Hive

A MapR supported version of Hive.  Hive is a Map/Reduce abstraction layer that provides SQL-like access to MapR data.   

You can find instructions to install Hive for MapR here: http://mapr.com/doc/display/MapR/Hive

HBase

A MapR supported version of HBase.  HBase is a NoSQL database that leverages MapR's CLDB storage.

You can find instructions to install HBase for MapR here: http://mapr.com/doc/display/MapR/HBase

Sample Data

The how-to's in this guide were built with sample weblog data.     The following files which are used and/or generated by the how-to's in this guide.    Each specific how-to will explain which file(s) it requires. 

File Name

Content

weblogs_rebuild.txt

Unparsed, raw weblog data

[weblogs_parse.txt\||]

Tab-delimited, parsed weblog data

[weblogs_hive.txt\||]

Tab-delimited, aggregated weblog data for a Hive weblogs_agg table

[weblogs_aggregate.txt\||]

Tab-delimited, aggregated weblog data

[webogs_hbase.txt\||]

Prepared data for HBase load

<<< Probably need to add links to download these files >>>

Loading Data into a MapR Cluster

The how-tos in this section will demonstrate how to load data into CLDB (MapR's distributed file system), Hive and HBase.

Loading Data into CLDB - <<<link to content>>>

Loading Data into Hive –  <<<link to content>>>

Loading Data into HBase –  <<<link to content>>>

Transforming Data within a MapR Cluster

The how-tos in this section will demonstrate how to leverage the massively parallel, fault tolerant MapR processing engine to transform resident cluster data.

Using Pentaho MapReduce to Parse Weblog Data – <<<link to content>>>

Using Pentaho MapReduce to Generate an Aggregate Dataset  - <<<link to content>>>

Transforming Data with Pig – <<<link to content>>>

Transforming Data within Hive – <<<link to content>>>

Extracting Data from the MapR Cluster

The how-tos in this section will demonstrate how to extract data from the MapR cluster and load it into an RDBMS table.

Extracting data from CLDB to load an RDBMS  -- <<<link to content>>>

Extracting data from Hive to load an RDBMS  – <<<link to content>>>

Extracting data from HBase to load an RDBMS  – <<<link to content>>>

Reporting on Data in the MapR Cluster

The how-tos in this section will demonstrate how to report on data that is resident within the MapR cluster.

Reporting on CLDB file data  – <<<link to content>>>

Reporting on Hive data  – <<<link to content>>>

Reporting on HBase data  – <<<link to content>>>

Unknown macro: {scrollbar}
  • No labels