How to create a report that sources data from Hive.
Prerequisites
In order follow along with this how-to guide you will need the following:
- MapR
- Pentaho Data Integration
- Hive
- Report Designer
- RDBMS Database
Sample Files
The sample data file needed for this guide is:
File Name |
Content |
Tab-delimited, aggregated weblog data for a Hive weblogs_agg table |
NOTE: If you have already completed the Transforming Data within Hive in MapR guide, then the necessary Hive table, weblog_aggs, will already be created and populated.
Step-By-Step Instructions
Setup
Start MapR if it is not already running.
Start Hive Server if it is not already running.
Create a Hive Table
NOTE: This task may be skipped if you have completed Transforming Data within Hive in MapR guide.
- Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
- Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.
create table weblogs_agg ( client_ip string, year string, month string, month_num int, pageviews bigint ) row format delimited;
- Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
- Load the Table: Load the Hive table by running the following commands:
hadoop fs -put weblog_hive.txt /user/hive/warehouse/weblogs_agg/
Create a Hive Report
In this task you will create a report that uses JDBC and HiveQL to report on data from Hive.
- Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
Speed Tip
You can download the Report definition HiveReport.prpt already completed
- Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:
Click 'Next' to go to the next screen.
- Create a Data Source: You need to create the Hive query to select data for this report, so click the plus
- Choose Type 'JDBC'
- Select 'Sample Query' in the Available Queries box and delete it by clicking the .
- Above the Connection box create a new connection by clicking the .
- Create a Database Connection: Click the
- Connection Name: Enter 'Hive'.
- Connection Type: Select 'Hadoop Hive'.
- Host Name and Port Number: Your Hive connection information. For local single node clusters use 'localhost' and port '10000'.
- Database Name: Enter 'default'
- Leave User Name and Password empty.
- Test the connection by pressing the 'Test' button.
When you are done your window should look like:
Click 'OK' to close the window.
above the connection box and do the following:
- In the Connections box select 'Hive'.
- Create a New Query: You need a query to select the data from Hive for the report, so click the
- Query Name: Enter 'Page Views'.
- Query: Enter the following
Select year, month,month_num, sum(pageviews) as pageviews From weblogs_agg Group by year, month,month_num Order by year, month_num
When you are done your window should look like:
above the 'Available Queries' box and do the following:
- Create a Second Query: You need a second query to get the list of unique IP Addresses so you can use them in your report parameters, so click the
- Query Name: Enter 'IP Addresses'.
- Query: Enter the following
When you are done your window should look like:
Select distinct client_ip from weblogs_agg
Click 'OK' to close the window.
above the 'Available Queries' box and do the following:
- Highlight the 'Page Views' query in the Report Design Wizard window.
Your Report Design Wizard window should now look like:
Click 'Next' to go to the next screen.
button and do the following:
- Choose Type 'JDBC'
- Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
- Group Items By: Add 'year'
- Selected Items: Add 'month' and 'pageviews' in that order.
When you are done your window should look like:
Click 'Finish' to complete to the design wizard.
- Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
- Select the 'JDBC (Hive)' data source.
- Name: Enter 'paramIPAddress'.
- Label: Enter 'IP Address'
- Value Type: Select 'String'
- Check 'Mandatory'
- Display Type: 'Drop Down'
- Query: Select 'IP Addresses'
- Value: Select 'client_ip'
- Display Name: Select 'client_ip'
When you are done your window should look like:
Click 'OK' to close the window.
- Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'JDBC: Hive', right click on 'Page Views' and select 'Edit Query'.
Then do the following:- In the Query box add the following line between the From and Group by lines:
Where client_ip = ${paramIPAddress}
Note there must be a space before the $. - When you are done your entire query will look like:
Click 'OK' to close the window.Select year, month,month_num, sum(pageviews) as pageviews From weblogs_agg Where client_ip = ${paramIPAddress} Group by year, month,month_num Order by year, month_num
- In the Query box add the following line between the From and Group by lines:
- Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hive_report.prpt' into a folder of your choice.
- Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:
- Do any desired formatting to the report.
Summary
In this guide you learned how to construct a report that sources data from Hive .