Using Pentaho MapReduce to Parse Mainframe Data

Using Pentaho MapReduce to Parse Mainframe Data

Unknown macro: {scrollbar}

How to use Pentaho to ingest a Mainframe file into HDFS, then use MapReduce to process into delimited records.

The steps in this guide include:

  1. Installing a plugin from the Marketplace

  2. Ingesting a Mainframe file into HDFS

  3. Developing a PDI Transformation as a Mapper Task

  4. Developing a PDI Job to Orchestrate all of the components

  5. Executing and Reviewing Output

This how-to guide was built using Pentaho Data Integration 5.0.4

 

Prerequisites

In order to follow along with this how-to guide you will need the following:

About the LegStar z/OS File reader Plugin

This community-developed plugin is provided by LegSem and is available through the PDI Marketplace. It uses the JDK to compile a client library for parsing your specific COBOL Copybook file. After you install the plugin you should copy tools.jar (from JDK_HOME/lib) into the data-integration/lib directory where you have Pentaho installed.

1. To install the plugin, launch Sppon and go to the menu Help -> Marketplace

2. Search for z/os and you should see the LegStar z/OS File reader Plugin

3. Click the Install this Plugin button

4. You need to add tools.jar to your PDI classpath, a simple way to do this is copy tools.jar from JDK_HOME/lib to data-integration/lib

5. Restart Spoon

Sample Files

The sample data needed for this guide is

File Name

Content

File Name

Content



Contains a COBOL copybook file and Mainframe file in EBCDIC format

Step-by-Step Instructions

Setup

Start Hadoop if not already running

Unzip the Data.zip to a directory

Create a Transformation to Convert z/OS File

This sample works by converting the existing Mainframe file that is in z/OS format into a text format for processing in Hadoop.

1. Start PDI on your desktop. Choose File -> New -> Transformation from the menu.

2. Add a z/OS File Input step (if you don't see this step check the pre-requisites for how to install through the Marketplace). Your transformation should look like this:


3. Edit the z/OS File Input step. For z/OS filename you will select your Mainframe file. If using the sample data, this is under mf-files/ZOS.FCUSTDAT.bin. This file contains messages of a variable length, so check that box as well. This uses the IBM01140 character set. You will need to know what codepage your EBCDIC file uses: {+}http://en.wikipedia.org/wiki/EBCDIC+ 


4. Configure the COBOL copybook. Go to the COBOL tab to place your copybook. The copybook is used to "translate" the Mainframe file into fields. If you are using the sample files, you can select Import COBOL and browse to copybooks/CUSTDATCC. Note that in the file browser you must change the drop-down to All files so this will show up in the list. You should now see your copybook has been loaded.


5. Configure the Fields. Click the Get Fields button so that PDI can parse the copybook definition and determine what fields will be present.

If you see an error that it was unable to find tools.jar, this is because tools.jar was not found in the classpath. A simple fix is to copy JDK_HOME/lib/tools.jar to data-integration/lib.

The fields should be properly read in like this:

6. Preview Data. Hit the Preview button to make sure that all of the settings are correct.

7. Save to HDFS. Now that the Mainframe data is being converted correctly, you can save the stream to Hadoop using the Hadoop File Output step. Add the step and create a hop from the z/OS File Input step to the Hadoop File Output step.


8. Configure HDFS Location. Set the correct HDFS input directory where you want your file to be stored. For this example we will store into HDFS /demo/mainframe/input. Make sure to use the correct hostname and port for your Hadoop Name Node.


9. Configure HDFS Content. On the Content tab, make the following settings:

  • Choose a separator character of semicolon ;

  • Uncheck and change the format to Unix.

10. Configure HDFS Fields. On the Fields tab, select the Get Fields button to read in all of the fields that are created by the previous step. For every numeric field, you need to change the format to #.# to make sure that there are no extra spaces written to HDFS.


11. Save your transformation, the z/OS to HDFS is complete. Save it as zos-to-hdfs.ktr.

12. You can now run this transformation and it should complete successfully. The sample data contains 10,000 rows, you should see the following in your Step Metrics tab after running the transformation.