Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Wiki Markup
{scrollbar}

Excerpt

How to create a report that sources data from a MapR CLDB file.

You will use weblog data to build a report that displays a count of pageviews by month for an IP Address.

In order to report on CLDB file data in a parameterized fashion, you will need to perform the following steps:

...

The sample data file needed for this guide is:

File Name

Content

weblogs_aggregate.txt.zip

Tab-delimited, aggregated weblog data


NOTE: If you have already completed the "Using Pentaho MapReduce to Generate an Aggregate Dataset " in MapR guide, then the necessary files will already be in the proper location.
This file should be placed into the CLDB at /weblogs/aggregate_mr using the following commands.

Code Block

hadoop fs \-mkdir /weblogs

...


hadoop fs \-mkdir /weblogs/aggregate_mr

...


hadoop fs \-put weblogs_aggregate.txt /weblogs/aggregate_mr/part-00000

Step-By-Step Instructions

...

In this task you will create a PDI transformation to get a list of distinct IP Addresses from CLDB data. This transformation will later be used as the data source for a report parameter picklist.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
    Tip
    titleSpeed Tip

    You can download the Kettle Transform cldb_ip_list.ktr already completed

  2. Add a Hadoop File Input Step: You are going to read data from a CLDB file, so expand the 'HadoopBig Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:

...

  1. Image Added

  2. Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
    1. File or directory: Enter 'maprfs://<CLDB>:<PORT>/weblogs/aggregate_mr'
      For local single node clusters use 'maprfs:///weblogs/aggregate_mr'
      <CLDB> and <PORT> are your connection information to the CLDB
    2. Regular Expression: Enter 'part.*'
    3. Click the 'Add' button.
      When you are done your window should look like this:
      Image Modified

  3. Define the File Content: Switch to the 'Content' tab and enter the following:
    1. Clear the Separator field and click the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'
      When you are done your screen should look like:

...


    1. Image Added

  1. Define the Fields: Switch to the 'Fields' tab and enter the following:

    Name

    Type

    Length

    client_ip

    String

    15

    year

    Integer

     

    month_num

    Integer

     

    pageviews

    Integer

     


    When you are done your window should look like:
    Image Removed
    Image Added
    Click 'OK' to close the window.

  2. Add a Sort Rows Step: You need to sort the CLDB file, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:

...

  1. Image Added

  2. Connect the Input and Sort Steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. Your canvas should look like this:

...

  1. Image Added

  2. Edit the Sort Step: Double-click on the 'Sort rows' node to edit its properties. Enter this information:
    1. Check 'Only pass unique rows? (verifies keys only)'
    2. Fields: Add 'client_ip' sorted in ascending order.
      When you are done your window should look like this:

...

    1. Image Added
      Click 'OK' to close the window.
  1. Add a Dummy Step: You need a component for the report to select it's data from, so expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Your transformation should look like:
    Image Modified

  2. Connect the Sort and Dummy steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Dummy (do nothing)' node. Your canvas should look like this:

...

  1. Image Added

  2. Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node to edit its properties. Set the Step name to 'Output'. When you are done your window should look like:

...

  1. Image Added
    Click 'OK' to close the window.

  2. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'cldb_ip_list.ktr' into a folder of your choice.

...

In this task you will create a PDI transformation to get the data for your report from CLDB. This transformation will later be used as the primary datasource for the report.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.'Transformation' option.
    Tip
    titleSpeed Tip

    You can download the Kettle Transform cldb_to_report.ktr already completed

  2. Add a Parameter: Right click on the transformation and select 'Transformation settings'.

...


  1. Image Added
    Then do the following:
    1. Switch to the 'Parameters' tab.
    2. Parameter: 'paramIPAddress'
    3. Default Value: '127.0.0.1'

      When you are done your window should look like:

...


    1. Image Added
      Click 'OK' to close the window.

  1. Add a Hadoop File Input Step: You are going to read data from a CLDB file, so expand the 'HadoopBig Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:

...


  1. Image Added
  2. Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
    1. File or directory: Enter 'maprfs://<CLDB>:<PORT>/weblogs/aggregate_mr'
      For local single node clusters use 'maprfs:///weblogs/aggregate_mr'
      <CLDB> and <PORT> are your connection information to the CLDB
    2. Regular Expression: Enter 'part.*'
    3. Click the 'Add' button.
      When you are done your window should look like this:

      Image Modified
  3. Define the File Content: Switch to the 'Content' tab and enter the following:
    1. Clear the Separator field and click the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'

      When you are done your screen should look like:

      Image Modified
  4. Define the Fields: Switch to the 'Fields' tab and enter the following:

    Name

    Type

    Length

    client_ip

    String

    15

    year

    Integer

     

    month_num

    Integer

     

    pageviews

    Integer

     


    When you are done your window should look like:
    Image Removed
    Image Added
    Click 'OK' to close the window.
  5. Add a Get Variables Step: You need to add the parameter you created earlier to your stream, so expand the 'Job' section of the Design palette and drag a 'Get Variables' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  6. Connect the Input and Get Variables steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Get Variables' node. Your canvas should look like this:

    Image Modified
  7. Edit the Get Variables Step: Double-click on the 'Get Variables' node to edit its properties. Enter this information:
    1. Name: Enter 'selectedIP'
    2. Variable: Enter '${paramIPAddress}'
    3. Type: Select 'String'.
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.
  8. Add a Filter rows Step: You want to filter for only the rows that match the selected IP Address, so expand the 'Flow' section of the Design palette and drag a 'Filter rows' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  9. Connect the Get Variables and Filter Rows steps: Hover the mouse over the 'Get Variables' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Filter rows' node. Your canvas should look like this:

    Image Modified
  10. Edit the Filter Rows Step: Double-click on the 'Filter Rows' node to edit its properties. Do the following:
    1. Click the <field> box to the left of the = and select 'client_ip'
    2. Click the <field> box to the right of the = and select 'selectedIP'
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.

  11. Add a Sort Rows Step: You want the rows in sorted order by year, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  12. Connect the Filter Rows and Sort Rows steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. When you release the mouse and a tooltip appears select the 'Result is TRUE' option. Your canvas should look like this:

    Image Modified
  13. Edit the Sort Rows Step: Double-click on the 'Sort rows' node to edit its properties. Do the following:
    1. Fieldname: Enter 'year'
    2. Ascending: Select 'Y'
      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.

  14. Add a Row Denormaliser Step: You want to roll up the records for each year into a single row with a field for every month, so expand the 'Transform' section of the Design palette and drag a 'Row Denormaliser' node onto the transformation canvas. Your transformation should look like:

    Image Modified
  15. Connect the Sort Rows and Denormaliser steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Row Denormaliser' node. Your canvas should look like this:

    Image Modified
  16. Edit the Denormaliser Step: Double-click on the 'Row denormaliser' node to edit its properties. Do the following:
    1. The key field: Select 'month_num'
    2. The fields that make up the grouping: Add 'client_ip' and 'year'
    3. Target fields: Enter the following

      Target fieldname

      Value fieldname

      Key Value

      Type

      January

      Pageviews

      1

      Integer

      February

      Pageviews

      2

      Integer

      March

      Pageviews

      3

      Integer

      April

      Pageviews

      4

      Integer

      May

      Pageviews

      5

      Integer

      June

      Pageviews

      6

      Integer

      July

      Pageviews

      7

      Integer

      August

      Pageviews

      8

      Integer

      September

      Pageviews

      9

      Integer

      October

      Pageviews

      10

      Integer

      November

      Pageviews

      11

      Integer

      December

      Pageviews

      12

      Integer


      When you are done your window should look like this:

      Image Modified
      Click 'OK' to close the window.

  17. Add two Dummy Steps: Expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Repeat to add a second 'Dummy (do nothing)' step. Your transformation should look like:

    Image Modified
  18. Connect the Denormaliser and Dummy steps: Hover the mouse over the 'Row denormaliser' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the first 'Dummy (do nothing)' node. Your canvas should look like this:

    Image Modified
  19. Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node you just connected to edit its properties. Change the Step name to 'Output'.

    Image Modified
  20. Connect the Filter Rows and Dummy steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the second 'Dummy (do nothing)' node. When you release the mouse and a tooltip appears select the 'Result is FALSE' option. Your canvas should look like this:

    Image Modified
  21. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'cldb_to_report.ktr' into a folder of your choice.

...

In this task you will create a report that uses PDI transformations to source data from CLDB files.

  1. Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
    Tip
    titleSpeed Tip

    You can download the Report definition CLDB to Report.prpt already completed

  1. Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:

    Image Modified
    Click 'Next' to go to the next screen.

  2. Create a Data Source: You need to create the Hive query to select data for this report, so click the plus button and do the following:
    1. Choose Type 'Pentaho Data Integration'

      Image Modified
    2. Create a New Query: Click the button to create a new query and do the following:
      1. Name: Enter 'Page Views'
      2. File: Select the cldb_to_report.ktr transformation you just created.
      3. Steps: Select 'Output'

        When you are done your window should look like:

        Image Modified
    3. Create a Second Query: Click the button to create a new query and do the following:
      1. Name: Enter 'IP List'
      2. File: Select the cldb_ip_list.ktr transformation you just created.
      3. Steps: Select 'Output'

        When you are done your window should look like:

        Image Modified
        Click 'OK' to close the window.
    4. Highlight the 'Page Views' query in the Report Design Wizard window.
      Your Report Design Wizard window should now look like:

      Image Modified
      Click 'Next' to go to the next screen.
  3. Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
    1. Selected Items: Add the following in this order
      1. Year
      2. January
      3. February
      4. March
      5. April
      6. May
      7. June
      8. July
      9. August
      10. September
      11. October
      12. November
      13. December

        When you are done your window should look like:

        Image Modified
        Click 'Finish' to complete to the design wizard.

  4. Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
    1. Select the 'Pentaho Data Integration' data source.
    2. Name: Enter 'paramIPAddress'.
    3. Label: Enter 'IP Address'
    4. Value Type: Select 'String'
    5. Check 'Mandatory'
    6. Display Type: 'Drop Down'
    7. Query: Select 'IP List'
    8. Value: Select 'client_ip'
    9. Display Name: Select 'client_ip'

      When you are done your window should look like:

      Image Modified
      Click 'OK' to close the window.

  5. Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'Pentaho Data Integration', right click on 'Page Views' and select 'Edit Query'.

    Image Modified
    Then do the following:
    1. Click on the 'Edit Parameter' button.
    2. Under 'Transformation Parameter' click the .
    3. DataRow Column: Select 'paramIPAddress'
    4. Transfomration Parameter: Select 'paramIPAddress'

      When you are done your window should look like:

      Image Modified
      Click 'OK' to close the window.

  6. Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'cldb_report.prpt' into a folder of your choice.
  7. Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:

    Image Modified
  8. Do any desired formatting to the report.

Summary

...

In this guide you learned how to construct a report that sources data from a CLDB file.

Wiki Markup
{scrollbar}