The how-tos are organized by function topic with each set explaining various techniques for loading, transforming, extracting and reporting on data within a MapR cluster. You are encouraged to perform the how-tos in order as often the output of one is sometimes used as the input of another. However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided.

MapR Topics

Child pages (Children Display)

depth	2
excerpt	true
excerptType	simple

Pre-Requisites

In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to. This section enumerates all of the components with some additional configuration and installation tips.

MapR

A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for MapR.

These guides were developed using the MapR M3 distribution version 1.2. You can find MapR downloads here: http://mapr.com/download

Pentaho Data Integration

...

Kettle

A desktop installation of the Kettle design tool called 'Spoon'. Download here and configure instructions are here

Pentaho Hadoop Distribution

A Hadoop node distribution of the Pentaho Data Integration (PDI) tool. Pentaho Hadoop Distribution (referred to as PHD from this point on) allows you to execute Pentaho MapReduce jobs on the MapR cluster. You can find instructions to download and install the software here: [TODO]Download here and configure instructions are here

Pentaho Report Designer

A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.You must copy all jars from PDI's libext directory and sub folders with the exception of the JDBC folder into Report Designers lib directory.(PRD) is a desktop tool for creating highly formatted reports that can be exported to many popular formats. Reports created with PRD can be published to a Pentaho BI Server so they can be accessed using a browser. Download here and configure instructions are here

Hive

A MapR supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to MapR data.

You can find instructions to install Hive for MapR here: http://mapr.com/doc/display/MapR/Hive

HBase

A MapR supported version of HBase. HBase is a NoSQL database that leverages the MapR 's CLDB storagefilesystem.

You can find instructions to install HBase for MapR here: http://mapr.com/doc/display/MapR/HBase

Sample Data

The how-to's to’s in this guide were built with sample weblog data. The following files which are used and/or generated by the how-to's to’s in this guide. Each specific how-to will explain which file(s) it requires.

File Name	Content
weblogs_rebuild.txt.zip	Unparsed, raw weblog data
weblogs_parse.txt.zip	Tab-delimited, parsed weblog data
weblogs_hive.txt.zip	Tab-delimited, aggregated weblog data for a Hive weblogs_agg table
weblogs_aggregate.txt.zip	Tab-delimited, aggregated weblog data
webogsweblogs_hbase.txt.zip	Prepared data for HBase load

...

Versions Compared

Old Version 17

New Version Current

Key

MapR Topics

Pre-Requisites

MapR

Pentaho Data Integration

Kettle

Pentaho Hadoop Distribution

Pentaho Report Designer

Hive

HBase

Sample Data

Page Comparison

Versions Compared

Old Version 17

New Version Current

Key

MapR Topics

Pre-Requisites

MapR

Pentaho Data Integration

Kettle

Pentaho Hadoop Distribution

Pentaho Report Designer

Hive

HBase

Sample Data