Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Reporting on Hive Data
This guide shows you how to create a report that sources data from Hive.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • MapR
  • Pentaho Data Integration
  • Hive
  • Report Designer
  • RDBMS Database

Sample Files

The sample data file needed for this guide is:

File Name

Content

weblogs_hive.txt

Tab-delimited, aggregated weblog data for a Hive weblogs_agg table


NOTE: If you have already completed the "Transforming Data within Hive" guide, then the necessary Hive table, weblog_aggs, will already be created and populated.

Step-By-Step Instructions

Setup

Start MapR if it is not already running.
Start Hive Server if it is not already running.

Create a Hive Table

NOTE: This task may be skipped if you have completed "Transforming Data within Hive" guide.

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.

create table weblogs_agg (
client_ip string,
year string,
month string,
month_num int,
pageviews bigint
)
row format delimited
fields terminated by '\t';

  1. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
  2. Load the Table: Load the Hive table by running the following commands:

hadoop fs -put weblog_hive.txt /user/hive/warehouse/weblogs_agg/

Create a Hive Report

In this task you will create a report that uses JDBC and HiveQL to report on data from MapR Hive.

  1. Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
  2. Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:


Click 'Next' to go to the next screen.

  1. Create a Data Source: You need to create the Hive query to select data for this report, so click the plus button and do the following:
    1. Choose Type 'JDBC'

    1. Select 'Sample Query' in the Available Queries box and delete it by clicking the .
    2. Above the Connection box create a new connection by clicking the .
    3. Create a Database Connection: Click the above the connection box and do the following:
      1. Connection Name: Enter 'Hive'.
      2. Connection Type: Select 'Hadoop Hive'.
      3. Host Name and Port Number: Your Hive connection information. For local single node clusters use 'localhost' and port '10000'.
      4. Database Name: Enter 'default'
      5. Leave User Name and Password empty.
      6. Test the connection by pressing the 'Test' button.


When you are done your window should look like:

Click 'OK' to close the window.

    1. In the Connections box select 'Hive'.
    2. Create a New Query: You need a query to select the data from Hive for the report, so click the above the 'Available Queries' box and do the following:
      1. Query Name: Enter 'Page Views'.
      2. Query: Enter the following

Select year, month,month_num, sum(pageviews) as pageviews
From weblogs_agg
Group by year, month,month_num
Order by year, month_num
When you are done your window should look like:

    1. Create a Second Query: You need a second query to get the list of unique IP Addresses so you can use them in your report parameters, so click the above the 'Available Queries' box and do the following:
      1. Query Name: Enter 'IP Addresses'.
      2. Query: Enter the following

Select distinct client_ip from weblogs_agg
When you are done your window should look like:

Click 'OK' to close the window.

    1. Highlight the 'Page Views' query in the Report Design Wizard window.

Your Report Design Wizard window should now look like:

Click 'Next' to go to the next screen.

  1. Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
    1. Group Items By: Add 'year'
    2. Selected Items: Add 'month' and 'pageviews' in that order.


When you are done your window should look like:

Click 'Finish' to complete to the design wizard.

  1. Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
    1. Select the 'JDBC (Hive)' data source.
    2. Name: Enter 'paramIPAddress'.
    3. Label: Enter 'IP Address'
    4. Value Type: Select 'String'
    5. Check 'Mandatory'
    6. Display Type: 'Drop Down'
    7. Query: Select 'IP Addresses'
    8. Value: Select 'client_ip'
    9. Display Name: Select 'client_ip'


When you are done your window should look like:

Click 'OK' to close the window.

  1. Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'JDBC: Hive', right click on 'Page Views' and select 'Edit Query'.


Then do the following:

    1. In the Query box add the following line between the From and Group by lines:

Where client_ip = ${paramIPAddress}
Note there must be a space before the $.

    1. When you are done your entire query will look like:


Select year, month,month_num, sum(pageviews) as pageviews
From weblogs_agg
Where client_ip = ${paramIPAddress}
Group by year, month,month_num
Order by year, month_num
Click 'OK' to close the window.

  1. Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hive_report.prpt' into a folder of your choice.
  2. Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:


  1. Do any desired formatting to the report.


Summary

In this guide you learned how to construct a report that sources data from Hive .

  • No labels