Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Corrected numbering


  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
  2. Add a Hadoop File Input Step: You are going to read data from a CLDB file, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:
    Image Modified
  3. Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
    1. File or directory: Enter 'maprfs://<CLDB>:<PORT>/weblogs/aggregate_mr'
      For local single node clusters use 'maprfs:///weblogs/aggregate_mr'
      <CLDB> and <PORT> are your connection information to the CLDB
    2. Regular Expression: Enter 'part.*'
    3. Click the 'Add' button.
      When you are done your window should look like this:
      Image Modified
  4. Define the File Content: Switch to the 'Content' tab and enter the following:
    1. Clear the Separator field and click the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'

      When you are done your screen should look like:

      Image Modified
  5. Define the Fields: Switch to the 'Fields' tab and enter the following:
















    When you are done your window should look like:

    Click 'OK' to close the window.
  6. Add a Sort Rows Step: You need to sort the CLDB file, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  7. Connect the Input and Sort Steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. Your canvas should look like this:

    Image Modified
  8. Edit the Sort Step: Double-click on the 'Sort rows' node to edit its properties. Enter this information:
    1. Check 'Only pass unique rows? (verifies keys only)'
    2. Fields: Add 'client_ip' sorted in ascending order.
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.
  9. Add a Dummy Step: You need a component for the report to select it's data from, so expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  10. Connect the Sort and Dummy steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Dummy (do nothing)' node. Your canvas should look like this:

    Image Modified
  11. Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node to edit its properties. Set the Step name to 'Output'. When you are done your window should look like:

    Image Modified
    Click 'OK' to close the window.

  12. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'cldb_ip_list.ktr' into a folder of your choice.


  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
  2. Add a Parameter: Right click on the transformation and select 'Transformation settings'.

    Image Modified
    Then do the following:
    1. Switch to the 'Parameters' tab.
    2. Parameter: 'paramIPAddress'
    3. Default Value: ''

      When you are done your window should look like:

      Image Modified
      Click 'OK' to close the window.

  3. Add a Hadoop File Input Step: You are going to read data from a CLDB file, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  4. Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
    1. File or directory: Enter 'maprfs://<CLDB>:<PORT>/weblogs/aggregate_mr'
      For local single node clusters use 'maprfs:///weblogs/aggregate_mr'
      <CLDB> and <PORT> are your connection information to the CLDB
    2. Regular Expression: Enter 'part.*'
    3. Click the 'Add' button.
      When you are done your window should look like this:

      Image Modified
  5. Define the File Content: Switch to the 'Content' tab and enter the following:
    1. Clear the Separator field and click the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'

      When you are done your screen should look like:

      Image Modified
  6. Define the Fields: Switch to the 'Fields' tab and enter the following:
















    When you are done your window should look like:

    Click 'OK' to close the window.
  7. Add a Get Variables Step: You need to add the parameter you created earlier to your stream, so expand the 'Job' section of the Design palette and drag a 'Get Variables' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  8. Connect the Input and Get Variables steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Get Variables' node. Your canvas should look like this:

    Image Modified
  9. Edit the Get Variables Step: Double-click on the 'Get Variables' node to edit its properties. Enter this information:
    1. Name: Enter 'selectedIP'
    2. Variable: Enter '${paramIPAddress}'
    3. Type: Select 'String'.
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.
  10. Add a Filter rows Step: You want to filter for only the rows that match the selected IP Address, so expand the 'Flow' section of the Design palette and drag a 'Filter rows' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  11. Connect the Get Variables and Filter Rows steps: Hover the mouse over the 'Get Variables' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Filter rows' node. Your canvas should look like this:

    Image Modified
  12. Edit the Filter Rows Step: Double-click on the 'Filter Rows' node to edit its properties. Do the following:
    1. Click the <field> box to the left of the = and select 'client_ip'
    2. Click the <field> box to the right of the = and select 'selectedIP'
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.

  13. Add a Sort Rows Step: You want the rows in sorted order by year, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  14. Connect the Filter Rows and Sort Rows steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. When you release the mouse and a tooltip appears select the 'Result is TRUE' option. Your canvas should look like this:

    Image Modified
  15. Edit the Sort Rows Step: Double-click on the 'Sort rows' node to edit its properties. Do the following:
    1. Fieldname: Enter 'year'
    2. Ascending: Select 'Y'
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.

  16. Add a Row Denormaliser Step: You want to roll up the records for each year into a single row with a field for every month, so expand the 'Transform' section of the Design palette and drag a 'Row Denormaliser' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  17. Connect the Sort Rows and Denormaliser steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Row Denormaliser' node. Your canvas should look like this:

    Image Modified
  18. Edit the Denormaliser Step: Double-click on the 'Row denormaliser' node to edit its properties. Do the following:
    1. The key field: Select 'month_num'
    2. The fields that make up the grouping: Add 'client_ip' and 'year'
    3. Target fields: Enter the following

      Target fieldname

      Value fieldname

      Key Value


















































      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.

  19. Add two Dummy Steps: Expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Repeat to add a second 'Dummy (do nothing)' step. Your transformation should look like:

    Image Modified
  20. Connect the Denormaliser and Dummy steps: Hover the mouse over the 'Row denormaliser' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the first 'Dummy (do nothing)' node. Your canvas should look like this:

    Image Modified
  21. Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node you just connected to edit its properties. Change the Step name to 'Output'.

    Image Modified
  22. Connect the Filter Rows and Dummy steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the second 'Dummy (do nothing)' node. When you release the mouse and a tooltip appears select the 'Result is FALSE' option. Your canvas should look like this:

    Image Modified
  23. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'cldb_to_report.ktr' into a folder of your choice.


  1. Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
  2. Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:

    Image Modified
    Click 'Next' to go to the next screen.

  3. Create a Data Source: You need to create the Hive query to select data for this report, so click the plus button and do the following:
    1. Choose Type 'Pentaho Data Integration'

      Image Modified
    2. Create a New Query: Click the button to create a new query and do the following:
      1. Name: Enter 'Page Views'
      2. File: Select the cldb_to_report.ktr transformation you just created.
      3. Steps: Select 'Output'

        When you are done your window should look like:

        Image Modified
    3. Create a Second Query: Click the button to create a new query and do the following:
      1. Name: Enter 'IP List'
      2. File: Select the cldb_ip_list.ktr transformation you just created.
      3. Steps: Select 'Output'

        When you are done your window should look like:

        Image Modified
        Click 'OK' to close the window.
    4. Highlight the 'Page Views' query in the Report Design Wizard window.
      Your Report Design Wizard window should now look like:

      Image Modified
      Click 'Next' to go to the next screen.
  4. Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
    1. Selected Items: Add the following in this order
      1. Year
      2. January
      3. February
      4. March
      5. April
      6. May
      7. June
      8. July
      9. August
      10. September
      11. October
      12. November
      13. December

        When you are done your window should look like:

        Image Modified
        Click 'Finish' to complete to the design wizard.

  5. Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
    1. Select the 'Pentaho Data Integration' data source.
    2. Name: Enter 'paramIPAddress'.
    3. Label: Enter 'IP Address'
    4. Value Type: Select 'String'
    5. Check 'Mandatory'
    6. Display Type: 'Drop Down'
    7. Query: Select 'IP List'
    8. Value: Select 'client_ip'
    9. Display Name: Select 'client_ip'

      When you are done your window should look like:

      Image Modified
      Click 'OK' to close the window.

  6. Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'Pentaho Data Integration', right click on 'Page Views' and select 'Edit Query'.

    Image Modified
    Then do the following:
    1. Click on the 'Edit Parameter' button.
    2. Under 'Transformation Parameter' click the .
    3. DataRow Column: Select 'paramIPAddress'
    4. Transfomration Parameter: Select 'paramIPAddress'

      When you are done your window should look like:

      Image Modified
      Click 'OK' to close the window.

  7. Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'cldb_report.prpt' into a folder of your choice.
  8. Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:

    Image Modified
  9. Do any desired formatting to the report.
