Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The sample data file needed for this guide is:

File Name

Content

weblogs_parse.txt|Using Pentaho with MapR^weblogs_parse.txt.zip|\

Unparsed, raw weblog data

...

Tip
titleSpeed Tip

You can download the Kettle Job load_hive.kjb already completed

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.
  2. Add a Start Job Entry: You need to tell PDI where to start the job, so expand the 'General' section of the Design palette and drag a 'Start' job entry onto the job canvas. Your canvas should look like:
    Image Modified
  3. Add a Copy File Job Entry: You will need to copy the parsed file into the Hive table, so expand the 'File Management' section of the Design palette and drag a 'Copy Files' job entry onto the job canvas. Your canvas should look like:
    Image Modified
  4. Connect the Start and Copy Files job entries: Hover the mouse over the 'Start' job entry and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Copy Files' node. Your canvas should look like:
    Image Modified
  5. Edit the Copy Files Job Entry: Double-click on the 'Copy Files' job entry to edit its properties. Enter this information:
    1. File/Folder source: maprfs://<CLDB>:<PORT>/weblogs/parse
      When running PDI on the same machine as the MapR cluster use: maprfs:///weblogs/parse the CLDB and port are not required.
      <CLDB> is the server name of the machine running the MapR CLDB.
      <PORT> is the port the MapR CLDB is running on.
    2. File/Folder destination: maprfs://<CLDB>:<PORT>/user/hive/warehouse/weblogs
      When running PDI on the same machine as the MapR cluster use: maprfs:///user/hive/warehouse/weblogs the CLDB and port are not required.
      <CLDB> is the server name of the machine running the MapR CLDB.
      <PORT> is the port the MapR CLDB is running on.
    3. Wildcard (RegExp): Enter 'part-.*'
    4. Click the 'Add' button to add the files to the list of files to copy.

When you are done your window should look like (your folder path may be different):

Click 'OK' to close the window.
Notice that you could also load a local file into hive using this step. The file does not already have to be in MapR.

...

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Query Hive for Data: Verify the data has been loaded to Hive by querying the weblogs table.
    Code Block
    select * from weblogs limit 10;
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.

...