Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Wiki Markup
{scrollbar}

Excerpt

How to use Pentaho to ingest a Mainframe file into HDFS, then use MapReduce to process into delimited records.

The steps in this guide include:

  1. Installing a plugin from the Marketplace
  2. Ingesting a Mainframe file into HDFS
  3. Developing a PDI Transformation as a Mapper Task
  4. Developing a PDI Job to Orchestrate all of the components
  5. Executing and Reviewing Output
Note

This how-to guide was built using Pentaho Data Integration 5.0.4


Prerequisites

In order to follow along with this how-to guide you will need the following:

...

1. To install the plugin, launch Sppon and go to the menu Help -> Marketplace

2. Search for z/os and you should see the LegStar z/OS File reader Plugin

...

3. Click the Install this Plugin button

4. You need to add tools.jar to your PDI classpath, a simple way to do this is copy tools.jar from JDK_HOME/lib to data-integration/lib

5. Restart Spoon

Sample Files

...

Start Hadoop if not already running

Unzip the Data.zip to a directory

Create a Transformation to Convert z/OS File

Info

This sample works by converting the existing Mainframe file that is in z/OS format into a text format for processing in Hadoop.

1. Start PDI on your desktop. Choose File -> New -> Transformation from the menu.

2. Add a z/OS File Input step (if you don't see this step check the pre-requisites for how to install through the Marketplace). Your transformation should look like this:

...


3. Edit the z/OS File Input step. For z/OS filename you will select your Mainframe file. If using the sample data, this is under mf-files/ZOS.FCUSTDAT.bin. This file contains messages of a variable length, so check that box as well. This uses the IBM01140 character set. You will need to know what codepage your EBCDIC file uses: {+}http://en.wikipedia.org/wiki/EBCDIC+ 



4. Configure the COBOL copybook. Go to the COBOL tab to place your copybook. The copybook is used to "translate" the Mainframe file into fields. If you are using the sample files, you can select Import COBOL and browse to copybooks/CUSTDATCC. Note that in the file browser you must change the drop-down to All files so this will show up in the list. You should now see your copybook has been loaded.



5. Configure the Fields. Click the Get Fields button so that PDI can parse the copybook definition and determine what fields will be present.

Note

If you see an error that it was unable to find tools.jar, this is because tools.jar was not found in the classpath. A simple fix is to copy JDK_HOME/lib/tools.jar to data-integration/lib.

The fields should be properly read in like this:

6. Preview Data. Hit the Preview button to make sure that all of the settings are correct.

...

You should also see your file in the Hadoop file browser:




Note

If you see a space at the very front of the file, you need to double check the settings in Step 10, make sure to put #.# for the Format on every Numeric value.

Create a Mapper Transformation to Process Converted File in Hadoop


Info

In this task you will create a Mapper Transformation that will parse the Mainframe file into a structured format, and pivot each transaction into its own row.

1. Choose File -> New -> Transformation from the menu.

...

5. Configure Split Fields. Double click the Split Fields step to configure the new fields that will be created. These are the same fields that were defined in the zos-to-hdfs transformation in part 1. Make sure that the field to split is "value" and that the delimiter matches what you configured in the zos-to-hdfs transformation.

Tip

...

You can copy a table from one transformation to another, but the columns must match up. If you look at the Hadoop File Output step from zos-to-hdfs, you will see that the columns are Name, Type, Format, Length, etc. However, the columns for the Split Fields step are New Field, ID, Remove ID, Type, Length, etc. You can copy from Hadoop File Output to Split Fields by using Excel to add the 2 new columns (ID and Remove ID). Start a new worksheet in Excel, then select all rows in Hadoop File Output, use CTRL-C to copy, and then CTRL-V to paste into Excel. Select Column B (Type), and hit Insert (do this twice). Now select all of your data in Excel, use CTRL-C to copy, and then CTRL-V to paste into the Split Fields Step.


Image Modified

6. Normalize the Transactions. You may have noticed that there are 5 transactions in each row. The way this COBOL copybook works, there can be anywhere from 0 to 5 transactions in each record. In a modern database, you would usually put each transaction in its own row. We can use the Row Normaliser step to pivot those individual transactions each into their own row. Add a Row Normaliser step and create a hop from Split Fields to Row Normaliser.

...

13. Create Value. Double click to edit the Concat Fields step. For the step name, change to "Create value", make the Target Field Name "newvalue", and make the Seperator a pipe "|". Then click the Get Fields button and you will see all fields in your stream are added. Remove key and newkey, and you should be left with 9 fields total.
To remove extra spaces that are from the original Mainframe file, make the following adjustments:

  • CustomerId: Set Format to #.#
  • CustomerName: Set Trim Type to "both"
  • CustomerAddress: Set Trim Type to "both"
  • TransactionNbr: Set Format to #.#
  • Tx Amount: Set Format to #.#


14. Add MapReduce Output. Now that we have our new key and new value, we're ready for these to be returned by our mapper. Add the MapReduce Output step, and create a hop from Create value to MapReduce Output.

...

Create a Job to orchestrate the z/OS to HDFS and Pentaho MapReduce

Info

In this task you will create a job that will convert the Mainframe file to CSV within HDFS, then run a "map only" MapReduce process to create an output file.

1. Within PDI, choose File -> New -> Job from the menu system.

...

Note

If you see an error that it was unable to find tools.jar, this is because tools.jar was not found in the classpath. A simple fix is to copy JDK_HOME/lib/tools.jar to data-integration/lib.

Image Modified

Note

If you see an error that PDI is unable to write to HDFS correctly, it could be that you have not yet configured the Big Data Shim. Double check the instructions about configuring the shim for your distribution here: Using Pentaho MapReduce to Parse Weblog Data

...