Reporting on HDFS File Data
How to create a report that sources data from a HDFS file. You will use weblog data to build a report that displays a count of pageviews by month for an IP Address.
In order to report on HDFS file data in a parameterized fashion, you will need to perform the following steps:
- Create a PDI Transformation that sources the picklist of distinct IP Address to be used for a report parameter
- Create a PDI Transformation that sources a set of weblog data for a selected IP Address. This will be the primary data source for the report
- Create a Report that uses the PDI transformations for parameter list and report data.
Prerequisites
In order follow along with this how-to guide you will need the following:
- Hadoop
- Pentaho Data Integration
- Report Designer
Sample Files
The sample data file needed for this guide is:
File Name |
Content |
Tab-delimited, aggregated weblog data |
NOTE: If you have already completed the Using Pentaho MapReduce to Generate an Aggregate Dataset guide, then the necessary files will already be in the proper location.
This file should be placed into the HDFS at /weblogs/aggregate_mr using the following commands.
hadoop fs \-mkdir /user/pdi/weblogs hadoop fs \-mkdir /user/pdi/weblogs/aggregate_mr hadoop fs \-put weblogs_aggregate.txt /user/pdi/weblogs/aggregate_mr/part-00000
Step-By-Step Instructions
Setup
Start Hadoop if it is not already running.
Create a Parameter Picklist PDI Transformation
In this task you will create a PDI transformation to get a list of distinct IP Addresses from HDFS data. This transformation will later be used as the data source for a report parameter picklist.
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
Speed Tip
You can download the Kettle Transform hdfs_ip_list.ktr already completed
- Add a Hadoop File Input Step: You are going to read data from a HDFS file, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:
- Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
- File or directory: Enter 'hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr'
- Regular Expression: Enter 'part.*'
- Click the 'Add' button.
When you are done your window should look like this:
- Define the File Content: Switch to the 'Content' tab and enter the following:
- Clear the Separator field and click the 'Insert TAB' button.
- Uncheck 'Header'
- Format: Select 'Unix'
When you are done your screen should look like:
- Define the Fields: Switch to the 'Fields' tab and enter the following:
Name
Type
Length
client_ip
String
15
year
Integer
Â
month_num
Integer
Â
pageviews
Integer
Â
When you are done your window should look like:
Click 'OK' to close the window.
- Add a Sort Rows Step: You need to sort the HDFS file, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:
- Connect the Input and Sort Steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. Your canvas should look like this:
- Edit the Sort Step: Double-click on the 'Sort rows' node to edit its properties. Enter this information:
- Check 'Only pass unique rows? (verifies keys only)'
- Fields: Add 'client_ip' sorted in ascending order.
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Dummy Step: You need a component for the report to select it's data from, so expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Your transformation should look like:
- Connect the Sort and Dummy steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Dummy (do nothing)' node. Your canvas should look like this:
- Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node to edit its properties. Set the Step name to 'Output'. When you are done your window should look like:
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_ip_list.ktr' into a folder of your choice.
Create a HDFS Query PDI Transformation
In this task you will create a PDI transformation to get the data for your report from HDFS. This transformation will later be used as the primary datasource for the report.
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
Speed Tip
You can download the Kettle Transform hdfs_to_report.ktr already completed
- Add a Parameter: Right click on the transformation and select 'Transformation settings'.
Then do the following:- Switch to the 'Parameters' tab.
- Parameter: 'paramIPAddress'
- Default Value: '127.0.0.1'
When you are done your window should look like:
Click 'OK' to close the window.
- Add a Hadoop File Input Step: You are going to read data from a HDFS file, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:
- Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
- File or directory: Enter 'hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr'
- Regular Expression: Enter 'part.*'
- Click the 'Add' button.
When you are done your window should look like this:
- File or directory: Enter 'hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr'
- Define the File Content: Switch to the 'Content' tab and enter the following:
- Clear the Separator field and click the 'Insert TAB' button.
- Uncheck 'Header'
- Format: Select 'Unix'
When you are done your screen should look like:
- Define the Fields: Switch to the 'Fields' tab and enter the following:
Name
Type
Length
client_ip
String
15
year
Integer
Â
month_num
Integer
Â
pageviews
Integer
Â
When you are done your window should look like:
Click 'OK' to close the window. - Add a Get Variables Step: You need to add the parameter you created earlier to your stream, so expand the 'Job' section of the Design palette and drag a 'Get Variables' node onto the transformation canvas. Your transformation should look like:
- Connect the Input and Get Variables steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Get Variables' node. Your canvas should look like this:
- Edit the Get Variables Step: Double-click on the 'Get Variables' node to edit its properties. Enter this information:
- Name: Enter 'selectedIP'
- Variable: Enter '${paramIPAddress}'
- Type: Select 'String'.
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Filter rows Step: You want to filter for only the rows that match the selected IP Address, so expand the 'Flow' section of the Design palette and drag a 'Filter rows' node onto the transformation canvas. Your transformation should look like:
- Connect the Get Variables and Filter Rows steps: Hover the mouse over the 'Get Variables' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Filter rows' node. Your canvas should look like this:
- Edit the Filter Rows Step: Double-click on the 'Filter Rows' node to edit its properties. Do the following:
- Click the <field> box to the left of the = and select 'client_ip'
- Click the <field> box to the right of the = and select 'selectedIP'
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Sort Rows Step: You want the rows in sorted order by year, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:
- Connect the Filter Rows and Sort Rows steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. When you release the mouse and a tooltip appears select the 'Result is TRUE' option. Your canvas should look like this:
- Edit the Sort Rows Step: Double-click on the 'Sort rows' node to edit its properties. Do the following:
- Fieldname: Enter 'year'
- Ascending: Select 'Y'
When you are done your window should look like this:
Click 'OK' to close the window.
- Add a Row Denormaliser Step: You want to roll up the records for each year into a single row with a field for every month, so expand the 'Transform' section of the Design palette and drag a 'Row Denormaliser' node onto the transformation canvas. Your transformation should look like:
- Connect the Sort Rows and Denormaliser steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Row Denormaliser' node. Your canvas should look like this:
- Edit the Denormaliser Step: Double-click on the 'Row denormaliser' node to edit its properties. Do the following:
- The key field: Select 'month_num'
- The fields that make up the grouping: Add 'client_ip' and 'year'
- Target fields: Enter the following
Target fieldname
Value fieldname
Key Value
Type
January
Pageviews
1
Integer
February
Pageviews
2
Integer
March
Pageviews
3
Integer
April
Pageviews
4
Integer
May
Pageviews
5
Integer
June
Pageviews
6
Integer
July
Pageviews
7
Integer
August
Pageviews
8
Integer
September
Pageviews
9
Integer
October
Pageviews
10
Integer
November
Pageviews
11
Integer
December
Pageviews
12
Integer
When you are done your window should look like this:
Click 'OK' to close the window.
- Add two Dummy Steps: Expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Repeat to add a second 'Dummy (do nothing)' step. Your transformation should look like:
- Connect the Denormaliser and Dummy steps: Hover the mouse over the 'Row denormaliser' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the first 'Dummy (do nothing)' node. Your canvas should look like this:
- Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node you just connected to edit its properties. Change the Step name to 'Output'.
- Connect the Filter Rows and Dummy steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the second 'Dummy (do nothing)' node. When you release the mouse and a tooltip appears select the 'Result is FALSE' option. Your canvas should look like this:
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_to_report.ktr' into a folder of your choice.
Create a Report
In this task you will create a report that uses PDI transformations to source data from HDFS files.
Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
Speed Tip
You can download the Report definition HDFS to Report.prpt already completed
- Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:
Click 'Next' to go to the next screen.
- Create a Data Source: You need to create the HDFS query to select data for this report, so click the plus button and do the following:
- Choose Type 'Pentaho Data Integration'
- Create a New Query: Click the button to create a new query and do the following:
- Name: Enter 'Page Views'
- File: Select the hdfs_to_report.ktr transformation you just created.
- Steps: Select 'Output'
When you are done your window should look like:
- Create a Second Query: Click the button to create a new query and do the following:
- Name: Enter 'IP List'
- File: Select the hdfs_ip_list.ktr transformation you just created.
- Steps: Select 'Output'
When you are done your window should look like:
Click 'OK' to close the window.
- Highlight the 'Page Views' query in the Report Design Wizard window.
Your Report Design Wizard window should now look like:
Click 'Next' to go to the next screen.
- Choose Type 'Pentaho Data Integration'
- Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
- Selected Items: Add the following in this order
- Year
- January
- February
- March
- April
- May
- June
- July
- August
- September
- October
- November
- December
When you are done your window should look like:
Click 'Finish' to complete to the design wizard.
- Selected Items: Add the following in this order
- Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
- Select the 'Pentaho Data Integration' data source.
- Name: Enter 'paramIPAddress'.
- Label: Enter 'IP Address'
- Value Type: Select 'String'
- Check 'Mandatory'
- Display Type: 'Drop Down'
- Query: Select 'IP List'
- Value: Select 'client_ip'
- Display Name: Select 'client_ip'
When you are done your window should look like:
Click 'OK' to close the window.
- Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'Pentaho Data Integration', right click on 'Page Views' and select 'Edit Query'.
Then do the following:
- Click on the 'Edit Parameter' button.
- Under 'Transformation Parameter' click the .
- DataRow Column: Select 'paramIPAddress'
- Transfomration Parameter: Select 'paramIPAddress'
When you are done your window should look like:
Click 'OK' to close the window.
- Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hdfs_report.prpt' into a folder of your choice.
- Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:
- Do any desired formatting to the report.
Summary
In this guide you learned how to construct a report that sources data from a HDFS file.