What's New?

Getting Started for Java Developers

About Kettle and Big Data
Hadoop Configurations
Sqoop Import & Export
Pentaho MapReduce — The Pentaho MapReduce job entry allows you to build MapReduce jobs using Kettle transformations as the Mapper, Combiner, and/or Reducer.

Blend of the Week

MongoDB Blend — Sample Blend template for MongoDB
Salesforce and Corporate Finance Blend — Sample Blend template for use with Salesforce.com
Google Analytics Customer Service Blend — Sample Blend template for use with Google Analytics

Sqrrl

How To's

Hadoop
- Understanding How Pentaho works with Hadoop
- Configuring Pentaho for your Hadoop Distro and Version — How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.
  - Configuring Pentaho for your Hadoop Distro and Version (Pentaho Suite Version 5.1) — How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.
    - Upgrade Hadoop in Community Edition to 5.0.4 — new-icon.png There have been major bug fixes and some additional functionality introduced into the Big Data Plugin since the 5.0.1 Community Edition was released. These instructions will show you how to upgrade CE to the 5.0.4 version of the Big Data components.
  - Additional Configuration for using MR1 with CDH5 — Additional configuration required to allow access to a CDH 5.0 cluster configured for MapReduce 1. This feature was removed in CDH 5.1.
  - Additional Configuration for MapR Shims — Additional configuration required to allow Pentaho to access MapR clusters.
  - Additional Configuration for YARN Shims
  - Install Hadoop Distribution Shim — Instructions for installing a new or downloaded shim.
  - Helpful Commands for Working with Hadoop Configurations — new-icon.png Helpful Scripts, Commands, etc. for Working with Hadoop Configurations.
- Reporting on Data in Hadoop — How to report on data that is resident within the Hadoop cluster.
  - Reporting on HDFS File Data — How to create a report that sources data from a HDFS file.
  - Reporting on Hive Data — How to create a report that sources data from Hive.
  - Reporting on HBase Data — How to create a report that sources data from HBase.
- Advanced Pentaho MapReduce — Advanced how-tos for developing Pentaho MapReduce.
  - Using a Custom Input or Output Format in Pentaho MapReduce — How to use a custom Input or Output Format in Pentaho MapReduce.
  - Processing HBase data in Pentaho MapReduce using TableInputFormat — How to use HBase TableInputFormat in Pentaho MapReduce.
  - Using Compression with Pentaho MapReduce — How to use compression with Pentaho MapReduce.
  - Using a Custom Partitioner in Pentaho MapReduce — How to use a custom partitioner in Pentaho MapReduce.
  - Unit Test Pentaho MapReduce Transformation — How to unit test the mapper and reducer transformations that make up a Pentaho MapReduce job.
- Extracting Data from the Hadoop Cluster — How to extract data from Hadoop using HDFS, Hive, and HBase.
  - Extracting Data from HDFS to Load an RDBMS — How to use a PDI transformation to extract data from HDFS and load it into a RDBMS table.
  - Extracting Data from HBase to Load an RDBMS — How to use a PDI transformation to extract data from HBase and load it into a RDBMS table.
  - Extracting Data from Hive to Load an RDBMS — How to use a PDI transformation to extract data from Hive and load it into a RDBMS table.
  - Extracting Data from Snappy Compressed Files — How to configure client-side PDI so that files compressed using the Snappy codec can be decompressed using the Hadoop file input or Text file input step.
- Transforming Data within a Hadoop Cluster — How to transform data within the Hadoop cluster using Pentaho MapReduce, Hive, and Pig.
  - Using Pentaho MapReduce to Parse Mainframe Data — How to use Pentaho to ingest a Mainframe file into HDFS, then use MapReduce to process into delimited records.
  - Using Pentaho MapReduce to Generate an Aggregate Dataset — How to use Pentaho MapReduce to transform and summarize detailed data into an aggregate dataset.
  - Using Pentaho MapReduce to Parse Weblog Data — How to use Pentaho MapReduce to convert raw weblog data into parsed, delimited records.
  - Transforming Data with Pig — How to invoke a Pig script from a PDI job.
  - Transforming Data within Hive — How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.
- Loading Data into a Hadoop Cluster — How to load data into HDFS (Hadoop's Distributed File System), Hive and HBase.
  - Loading Data into HBase — Using a PDI transformation that sources data from a flat file and writes to an HBase table.
  - Loading Data into HDFS — How to use a PDI job to move a file into HDFS.
  - Simple Chrome Extension to browse HDFS volumes — How to add a Chrome Omnibox extension to support HDFS browsing.
  - Loading Data into Hive — How to use a PDI job to load a data file into a Hive table.
MapR
- Extracting Data from the MapR Cluster — How to extract data from the MapR cluster and load it into an RDBMS table.
  - Extracting Data from HBase to Load an RDBMS in MapR — How to use a PDI transformation to extract data from HBase and load it into a RDBMS table.
  - Extracting Data from Hive to Load an RDBMS in MapR — How to use a PDI transformation to extract data from Hive and load it into a RDBMS table.
  - Extracting Data from CLDB to Load an RDBMS — How to use a PDI transformation to extract data from MapR CLDB and load it into a RDBMS table.
- Loading Data into a MapR Cluster — How to load data into CLDB (MapR’s distributed file system), Hive and HBase.
  - Loading Data into the MapR filesystem — How to use a PDI job to move a file into the MapR filesystem.
  - Loading Data into MapR HBase — How to use a PDI transformation that sources data from a flat file and writes to an HBase table.
  - Loading Data into MapR Hive — How to use a PDI job to load a data file into a Hive table.
- Reporting on Data in the MapR Cluster — How to report on data that is resident within the MapR cluster.
  - Reporting on Hive Data in MapR — How to create a report that sources data from Hive.
  - Reporting on CLDB File Data — How to create a report that sources data from a MapR CLDB file.
  - Reporting on HBase Data in MapR — How to create a report that sources data from HBase.
- Transforming Data within a MapR Cluster — How to leverage the massively parallel, fault tolerant MapR processing engine to transform resident cluster data.
  - Using Pentaho MapReduce to Parse Weblog Data in MapR — How to use Pentaho MapReduce to convert raw weblog data into parsed, delimited records.
  - Transforming Data within Hive in MapR — How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.
  - Transforming Data with Pig in MapR — How to invoke a Pig script from a PDI job.
  - Using Pentaho MapReduce to Generate an Aggregate Dataset in MapR — How to use Pentaho MapReduce to transform and summarize detailed data into an aggregate dataset.
Cassandra
- Write Data To Cassandra — How to read data from a data source (flat file) and write it to a column family in Cassandra using a graphic tool.
- How To Create a Report with Cassandra — How to create a report that uses data from a column family in Cassandra using graphic tools.
- How To Read Data From Cassandra — How to read data from a column family in Cassandra using a graphic tool.
MongoDB
- Write Data To MongoDB — How to read data from a data source (flat file) and write it to a collection in MongoDB
- Create a Report with MongoDB — How to create a report that uses data from a collection in MongoDB.
- Read Data From MongoDB — How to read data from a collection in MongoDB.
- Create a Parameterized Report with MongoDB — How to create a parameterize report that uses data from a collection in MongoDB.

Pentaho Labs - Big Data

Kettle Execution on Storm — An experimental environment for executing a Kettle transformation as a Storm topology.
Pentaho Map Reduce Vizor
Kettle on Spark — An experimental environment for executing a Kettle transformation as a Spark Stream.
Weka Execution in Hadoop — A recipe for executing Weka in Hadoop.
MongoDB Development — MongoDB Development tasks and priorities
Cassandra Development — Cassandra Development tasks and priorities

What's New?

Getting Started for Java Developers

Blend of the Week

Sqrrl

How To's

Pentaho Labs - Big Data

Downloads