...
The sample data file needed for this guide is:
File Name | Content |
weblogs_parse.txt|Using Pentaho with MapR^weblogs_parse.txt.zip|\ | Unparsed, raw weblog data |
...
Tip | ||
---|---|---|
| ||
You can download the Kettle Job load_hive.kjb already completed |
- Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.
- Add a Start Job Entry: You need to tell PDI where to start the job, so expand the 'General' section of the Design palette and drag a 'Start' job entry onto the job canvas. Your canvas should look like:
- Add a Copy File Job Entry: You will need to copy the parsed file into the Hive table, so expand the 'File Management' section of the Design palette and drag a 'Copy Files' job entry onto the job canvas. Your canvas should look like:
- Connect the Start and Copy Files job entries: Hover the mouse over the 'Start' job entry and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Copy Files' node. Your canvas should look like:
- Edit the Copy Files Job Entry: Double-click on the 'Copy Files' job entry to edit its properties. Enter this information:
- File/Folder source: maprfs://<CLDB>:<PORT>/weblogs/parse
When running PDI on the same machine as the MapR cluster use: maprfs:///weblogs/parse the CLDB and port are not required.
<CLDB> is the server name of the machine running the MapR CLDB.
<PORT> is the port the MapR CLDB is running on. - File/Folder destination: maprfs://<CLDB>:<PORT>/user/hive/warehouse/weblogs
When running PDI on the same machine as the MapR cluster use: maprfs:///user/hive/warehouse/weblogs the CLDB and port are not required.
<CLDB> is the server name of the machine running the MapR CLDB.
<PORT> is the port the MapR CLDB is running on. - Wildcard (RegExp): Enter 'part-.*'
- Click the 'Add' button to add the files to the list of files to copy.
- File/Folder source: maprfs://<CLDB>:<PORT>/weblogs/parse
When you are done your window should look like (your folder path may be different):
Click 'OK' to close the window.
Notice that you could also load a local file into hive using this step. The file does not already have to be in MapR.
...
- Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
- Query Hive for Data: Verify the data has been loaded to Hive by querying the weblogs table.
Code Block select * from weblogs limit 10;
- Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
...