How to configure client-side PDI so that files compressed using the Snappy codec can be decompressed using the Hadoop file input or Text file input step.
Prerequisites
- Pentaho Data Integration
- Snappy compressed source data (either inside or outside of HDFS)
Step-By-Step Instructions
Configure PDI to Access Snappy Native Libraries
In order to use client-side PDI to decompress files encoded by hadoop-snappy (the snappy implementation used in Hadoop) it is necessary to build and install both the hadoop-snappy JNI interface and the snappy native libraries for your platform. Instructions for achieving this can be found at:
http://code.google.com/p/hadoop-snappy/
In particular, the instructions under "Build Hadoop Snppy" should be followed. The "Install Hadoop Snappy in Hadoop" instructions should only be followed if
- You want to decompress snappy encoded files within a Pentaho map reduce job (see Using Compression with Pentaho MapReduce for more information), and
- Your Hadoop installation does not have snappy hadoop-snappy installed already (recent Hadoop distributions from Cloudera etc. are configured with hadoop-snappy out of the box)
Once you have built hadoop-snappy:
- Uncompress the hadoop-snappy-x.y.z-SNAPSHOT.tar.gz archive the build process creates somewhere on your client PDI machine
- Copy hadoop-snappy-x.y.z-SNAPSHOT/lib/hadoop-snappy-x.y.z-SNAPSHOT.jar to libext/bigdata in your client PDI installation
- Set the java.library.path property to point to the subdirectory of hadoop-snappy-x.y.z-SNAPSHOT/lib/native that corresponds to your platform
Where to set the java.library.path in Step 3 will vary depending on your platform:
- Under Linux edit "spoon.sh" in your PDI installation directory and add an entry to the LIBPATH variable
- Under Windows edit "Spoon.bat" and add an entry to the LIBSPATH variable
- Under Mac OS X edit "Data Integration 64-bit.app/Contents/Info.plist" and add "-Djava.library.path=<path to the subdirectory in Step 3>" to the string entry under the key "VMOptions"
Verifying that Snappy Decompression is Available to PDI
After following the instructions of the previous section restart PDI. If hadoop-snappy and the snappy native libraries have been installed correctly on the PDI client machine then a "Hadoop-snappy" option will be available under the "Compression" drop-down box on the "Content" tab of the Hadoop file input and Text file input steps.