Include Extracting Data from Hive to Load an RDBMS
Create a Hive Table
NOTE: This task may be skipped if you have completed the "Loading Data into Hive" guide.
Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.
create table weblogs ( client_ip string, full_request_date string, day string, month string, month_num int, year string, hour string, minute string, second string, timezone string, http_verb string, uri string, http_status_code string, bytes_returned string, referrer string, user_agent string) row format delimited fields terminated by '\t';Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
Load the Table: Load the Hive table by running the following commands:
hadoop fs -put part-00000.txt /user/hive/warehouse/weblogs/
Create a Transformation to Extract Data from Hive
In this task you will create a transformation to extract data from Hive and load into a RDBMS table.
Speed Tip
You can download the Kettle Transformation hive_to_rdbms.ktr already completed
Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
Add a Table Input Step: You are going to extract data from Hive using HiveQL, so expand the 'Input' section of the Design palette and drag a 'Table Input' node onto the transformation canvas. Your transformation should look like:
Edit the Table Input Step: Double-click on the 'Table Input' node to edit its properties. Enter this information:
Connection: Select 'Hive'
SQL: Enter the following
select client_ip, year, month, month_num, count(*) as pageviews from weblogs group by client_ip, year, month, month_num
When you are done your 'Map/Reduce Input' window should look like this:
Click 'OK' to close the window.
Add a Table Output Step: You are going to write the data to a database, so expand the 'Output' section of the Design palette and drag a 'Table Output' node onto the transformation canvas. Your transformation should look like:
Notice that there are lots of other output formats you can use such as Text file, XML, and Microsoft Excel.
Connect the Input and Output steps: Hover the mouse over the 'Table Input' node and a tooltip will appear.
Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Table Output' node. Your canvas should look like this:
Edit the Table Output Step: Double-click on the 'Table Output' node to edit its properties. Do the following:
Connection: Select 'RDBMS'
Target Table: Enter 'aggregate_hive'
Check 'Truncate table' so you can re-run this transformation.
Click the 'SQL' button to create the table in your target database.
Click the 'Execute' button to run the SQL.
The 'Results of the SQL statements' window will appear telling you if the SQL was succesffully executed or give you troubleshooting information if the SQL failed.Click 'OK' to close the window.
Click 'Close' to close the 'Simple SQL editor' window.
When you are done your window should look like:
Click 'OK' to close the window.
Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hive_to_rdbms.ktr' into a folder of your choice.
Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:
If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.
Check RDBMS for Hive Aggregated Table
Explore the Database: Choose 'Tools' -> 'Database' -> 'Explore' from the menu system.
Select the Connection: In the 'Make your selection' window select 'RDBMS' and click 'OK'.
Preview the Table: Expand RDBMS -> Tables. Right click on 'aggregate_hive' and select 'Preview first 100'
Summary
During this guide you learned how to create a PDI transformation that sources data from Hive and loads into an RDBMS table.