Hadoop Copy Files
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
Hadoop Copy Files
This job entry copies files in a Hadoop cluster from one location to another.
General
Option | Definition |
---|---|
Include Subfolders | If selected, all subdirectories within the chosen directory will be copied as well |
Destination is a file | Determines whether the destination is a file or a directory |
Copy empty folders | If selected, will copy all directories, even if they are empty the Include Subfolders option must be selected for this option to be valid |
Create destination folder | If selected, will create the specified destination directory if it does not currently exist |
Replace existing files | If selected, duplicate files in the destination directory will be overwritten |
Remove source files | If selected, removes the source files after copy (a move procedure) |
Copy previous results to args | If selected, will use previous step results as your sources and destinations |
File/folder source | The file or directory to copy from; click Browse and select Hadoop to enter your Hadoop cluster connection details |
File/folder destination | The file or directory to copy to; click Browse and select Hadoop to enter your Hadoop cluster connection details |
Wildcard (RegExp) | Defines the files that are copied in regular expression terms (instead of static file names), for instance: .*\.txt would be any file with a .txt extension |
Files/folders | A list of selected sources and destinations |
Result files name
Option | Definition |
---|---|
Add files to result files name | Any files that are copied will appear as a result from this step; shows a list of files that were copied in this step |
Notes
When not using Kerberos security, the Hadoop API used by this step sends the username of the logged in user when trying to copy the file(s) regardless of what username was used in the connect field. To Change the user you must set the environment variable HADOOP_USER_NAME. You can modify spoon.bat or spoon.sh by changing the OPT variable:
OPT="$OPT .... -DHADOOP_USER_NAME=HadoopNameToSpoof"