...
NOTE: If you have previously completed the "Using Pentaho MapReduce to Parse Weblog Data" guide the necessary files will already be the proper Anchor
This file should be placed in the /weblogs/parse directory of the CLDB using the following commands.
Code Block |
---|
hadoop fs --mkdir /weblogs |
...
hadoop fs --mkdir /weblogs/parse |
...
hadoop fs --put weblogs_parse.txt /weblogs/parse/part-00000 |
Step-By-Step Instructions
...
- Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
- Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.
Code Block create table weblogs (
...
client_ip string,
...
full_request_date string,
...
day string,
...
month string,
...
month_num int,
...
year string,
...
hour string,
...
minute string,
...
second string,
...
timezone string,
...
http_verb string,
...
uri string,
...
http_status_code string,
...
bytes_returned string,
...
referrer string,
...
user_agent string)
...
row format delimited
...
fields terminated by '\t';
- Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
...
In this task you will be creating a job to load parsed and delimited weblog data into a Hive table. Once the data is loaded into the table, you will be able to run HiveQL statements to query this data.
Tip | ||
---|---|---|
| ||
You can download the Kettle Job load_hive.kjb already completed |
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.
- Add a Start Job Entry: You need to tell PDI where to start the job, so expand the 'General' section of the Design palette and drag a 'Start' job entry onto the job canvas. Your canvas should look like:
...
- Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
- Query Hive for Data: Verify the data has been loaded to Hive by querying the weblogs table.
Code Block |
---|
select * from weblogs limit 10; |
- Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
...