Create Mapper and Reducer for Aggregate Dataset

Create Mapper and Reducer for Aggregate Dataset

Create a Pentaho Mapper Transformation

In this task you will create a Pentaho Mapper transformation. This transformation will be used to run a Pentaho MapReduce job on the Hadoop cluster. This transformation will consume a parsed, tab-delimited weblog record and construct intermediate data consisting of a key comprised of Client IP, Year, and Month and emitting a constant value of 1. The value denotes a single pageview for the key. The summing will be done by the Reducer – which we will develop next.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.

  2. Add a MapReduce Input Step: You are going to read data into the transformation from MapReduce, so expand the 'Big Data' section of the Design palette and drag a 'MapReduce Input' node onto the transformation canvas. Your transformation should look like:

  3. Edit the MapReduce Input Step: Double-click on the 'MapReduce Input' node to edit its properties. Enter this information:

    1. Key Field Type: Enter String

    2. Value Field Type: Enter String
      When you are done your 'MapReduce Input' window should look like this:

      Click 'OK' to close the window.

  4. Add a Split Fields Step: You need to split the incoming records on tab to get the individual fields in the record, so expand the 'Transform' section of the Design palette and drag a 'Split Fields' node onto the transformation canvas. Your transformation should look like:

  5. Connect the Input and Split Fields Steps: Hover the mouse over the 'MapReduce Input' node and a tooltip will appear.
    Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Split Fields' node. Your canvas should look like this:

  6. Edit the Split Fields Step: Double-click on the 'Split Fields' node to edit its properties. Enter this information:

    1. Field to split: Select 'value'

    2. Delimiter: Enter '$[09]'  09 is the hexadecimal representation of the ASCII tab character.

    3. Fields: The field list will be the following of with 'Type' set to 'String'

      1. client_ip

      2. full_request_date

      3. day

      4. month

      5. month_num

      6. year

      7. hour

      8. minute

      9. second

      10. timezone

      11. http_verb

      12. uri

      13. http_status_code

      14. bytes_returned

      15. referrer

      16. user_agent
        When you are done your 'MapReduce Input' window should look like this:

        Click 'OK' to close the window.

  7. Add a User Defined Java Expression Step: You need to concatenate the client_ip, year, and month together to create the key field, so expand the 'Scripting' section of the Design palette and drag a 'User Defined Java Expression' node onto the transformation canvas. Your transformation should look like:

  8. Connect the Split Fields and User Defined Java Expression Steps: Hover the mouse over the 'Split Fields' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'User Defined Java Expression' node. Your canvas should look like this:

  9. Edit the User Defined Java Expression Step: Double-click on the 'User Defined Java Expression' node to edit its properties. Do the following:

    1. Create a new field 'new_key' with Type 'String' and the following Java expression:

      client_ip + '\t' + year + '\t' + month_num

      Note the characters between the '' are tabs. You will have to copy and paste tab characters into the Java Expression.

    2. Create a new field 'new_value' with Type 'Integer' and the Java expression '1'.
      When you are done your window should look like:

      Click 'OK' to close the window.

  10. Add a MapReduce Output Step: You need to write the new key and new value to the output, so expand the 'Big Data' section of the Design palette and drag a 'MapReduce Output' node onto the transformation canvas. Your transformation should look like:

  11. Connect the Java Expression and Output Steps: Hover the mouse over the 'User Defined Java Expression' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'MapReduce Output' node. Your canvas should look like this:

  12. Edit the Output Step: Double-click on the 'MapReduce Output' node to edit its properties. Enter the following information:

    1. Key field: Select 'new_key'

    2. Value field: Select 'new_value'
      When you are done your window should look like:

      Click 'OK' to close the window.

  13. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'aggregate_mapper.ktr' into a folder of your choice.