Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Wiki Markup
{scrollbar}

Excerpt

How to create a report that sources data from Hive.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • MapR
  • Pentaho Data Integration
  • Hive
  • Report Designer
  • RDBMS Database

Sample Files

The sample data file needed for this guide is:

File Name

Content

weblogs_hive.txtzip

Tab-delimited, aggregated weblog data for a Hive weblogs_agg table


NOTE: If you have already completed the "Transforming Data within Hive " in MapR guide, then the necessary Hive table, weblog_aggs, will already be created and populated.

Step-By-Step Instructions

Setup

Start MapR if it is not already running.
Start Hive Server if it is not already running.

Create a Hive Table

NOTE: This task may be skipped if you have completed "Transforming Data within Hive " in MapR guide.

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.

create table weblogs_agg (
client_ip string,
year string,
month string,
month_num int,
pageviews bigint
)
row format delimited
fields terminated by '\t';

  1. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
  2. Load the Table: Load the Hive table by running the following commands:

hadoop fs -put weblog_hive.txt /user/hive/warehouse/weblogs_agg/

Create a Hive Report

In this task you will create a report that uses JDBC and HiveQL to report on data from MapR Hive.

  1. Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
  2. Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:

Image Removed
Click 'Next' to go to the next screen.

  1. Create a Data Source: You need to create the Hive query to select data for this report, so click the plus Image Removed button and do the following:
    1. Choose Type 'JDBC'

Image Removed

    1. Select 'Sample Query' in the Available Queries box and delete it by clicking the Image Removed .
    2. Above the Connection box create a new connection by clicking the Image Removed .
    3. Create a Database Connection: Click the Image Removed above the connection box and do the following:
      1. Connection Name: Enter 'Hive'.
      2. Connection Type: Select 'Hadoop Hive'.
      3. Host Name and Port Number: Your Hive connection information. For local single node clusters use 'localhost' and port '10000'.
      4. Database Name: Enter 'default'
      5. Leave User Name and Password empty.
      6. Test the connection by pressing the 'Test' button.

Image Removed
When you are done your window should look like:
Image Removed
Click 'OK' to close the window.

    1. In the Connections box select 'Hive'.
    2. Create a New Query: You need a query to select the data from Hive for the report, so click the Image Removed above the 'Available Queries' box and do the following:
      1. Query Name: Enter 'Page Views'.
      2. Query: Enter the following

Select year, month,month_num, sum(pageviews) as pageviews
From weblogs_agg
Group by year, month,month_num
Order by year, month_num
When you are done your window should look like:
Image Removed

    1. Create a Second Query: You need a second query to get the list of unique IP Addresses so you can use them in your report parameters, so click the Image Removed above the 'Available Queries' box and do the following:
      1. Query Name: Enter 'IP Addresses'.
      2. Query: Enter the following

Select distinct client_ip from weblogs_agg
When you are done your window should look like:
Image Removed
Click 'OK' to close the window.

    1. Highlight the 'Page Views' query in the Report Design Wizard window.

Your Report Design Wizard window should now look like:
Image Removed
Click 'Next' to go to the next screen.

  1. Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
    1. Group Items By: Add 'year'
    2. Selected Items: Add 'month' and 'pageviews' in that order.

...

  1. Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
    1. Select the 'JDBC (Hive)' data source.
    2. Name: Enter 'paramIPAddress'.
    3. Label: Enter 'IP Address'
    4. Value Type: Select 'String'
    5. Check 'Mandatory'
    6. Display Type: 'Drop Down'
    7. Query: Select 'IP Addresses'
    8. Value: Select 'client_ip'
    9. Display Name: Select 'client_ip'

...

  1. Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'JDBC: Hive', right click on 'Page Views' and select 'Edit Query'.

Image Removed
Then do the following:

    1. In the Query box add the following line between the From and Group by lines:

Where client_ip = ${paramIPAddress}
Note there must be a space before the $.

    1. When you are done your entire query will look like:

...

  1. Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hive_report.prpt' into a folder of your choice.
  2. Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:

Image Removed

  1. Do any desired formatting to the report.

Summary

...

Include Page
Include Create Hive Report
Include Create Hive Report
Wiki Markup
{scrollbar}