This section contains a series of How-Tos that demonstrate the integration between Pentaho and MapR using a sample weblog dataset.
The how-tos are organized by topic with each set explaining various techniques for loading, transforming, extracting and reporting on data within a MapR cluster. You are encouraged to perform the how-tos in order as the output of one is sometimes used as the input of another. However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided.
MapR Topics
Child pages (Children Display) | ||||||
---|---|---|---|---|---|---|
|
Pre-Requisites
In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to. This section enumerates all of the components with some additional configuration and installation tips.
MapR
A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for MapR.
...
Wiki Markup |
---|
{scrollbar} !http://www.mapr.com/images/spotlight/mapr.png|align=right! This section contains a series of How-Tos that demonstrate the integration between Pentaho and MapR using a sample weblog dataset. The how-tos are organized by topic with each set explaining various techniques for loading, transforming, extracting and reporting on data within a MapR cluster. You are encouraged to perform the how-tos in order as the output of one is sometimes used as the input of another. However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided. h1. MapR Topics {children:excerpt=true|depth=2} h1. Pre-Requisites In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to. This section enumerates all of the components with some additional configuration and installation tips. h2. MapR A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for MapR. These guides were developed using the MapR M3 distribution version 1.2. You can find MapR downloads here: [http://mapr.com/download |
...
Pentaho Data Integration
...
|http://mapr.com/download] h2. Pentaho Data Integration PDI will be the primary development environment for the how-tos. You will need version 4.3. You can find instructions to download and install PDI in the [BAD:Configure Pentaho for MapR] guide. |
...
h2. Pentaho Hadoop Distribution |
...
A Hadoop node distribution of the Pentaho Data Integration (PDI) tool. Pentaho Hadoop Distribution (referred to as PHD from this point on) allows you to execute Pentaho MapReduce jobs on the MapR cluster. |
...
You can find instructions to download and install the software here: [BAD:Configure Pentaho for |
...
Pentaho Report Designer
A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.
You can find instructions to download and install Report Designer in the Configure Pentaho for MapR guide.
Hive
A MapR supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to MapR data.
...
MapR] h2. Pentaho Report Designer A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory. You can find instructions to download and install Report Designer in the [BAD:Configure Pentaho for MapR] guide. h2. Hive A MapR supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to MapR data. You can find instructions to install Hive for MapR here: [http://mapr.com/doc/display/MapR/Hive|http://mapr.com/doc/display/MapR/Hive |
...
HBase
A MapR supported version of HBase. HBase is a NoSQL database that leverages MapR's CLDB storage.
...
] h2. HBase A MapR supported version of HBase. HBase is a NoSQL database that leverages MapR's CLDB storage. You can find instructions to install HBase for MapR here: [http://mapr.com/doc/display/MapR/HBase |
...
Sample Data
...
|http://mapr.com/doc/display/MapR/HBase] h2. Sample Data The how-to's in this guide were built with sample weblog data. The following files which are used and/or generated by the how-to's in this guide. Each specific how-to will explain which file(s) it requires. |
...
| *File Name |
...
Content
...
* | *Content* | | *[weblogs_rebuild.txt|MapR^weblog_rebuild.txt |
...
.zip]* | Unparsed, raw weblog |
...
data | | *weblogs_parse.txt |
...
* | Tab-delimited, parsed weblog |
...
data | | *weblogs_hive.txt |
...
* | Tab-delimited, aggregated weblog data for a Hive weblogs_agg |
...
table | | *weblogs_aggregate.txt |
...
* | Tab-delimited, aggregated weblog |
...
data | | *webogs_hbase.txt |
...
* | Prepared data for HBase load | |