Hadoop File Output
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
Description
The Hadoop File Output step is used to export data to text files stored on a Hadoop cluster. This is commonly used to generate comma separated values (CSV files) that can be read by spreadsheet applications. It is also possible to generate fixed width files by setting lengths on the fields in the fields tab.
Options
These tables describe all available Hadoop File Output options.Â
File Tab
The options under the File tab is where you define basic properties about the file being created.
Option | Description |
---|---|
Step name | Optionally, you can change the name of this step to fit your needs. Every step in a transformation must have a unique name. |
Hadoop Cluster | Allows you to create, edit, and select a Hadoop cluster configuration for use.  Hadoop cluster configurations settings can be reused in transformation steps and job entries that support this feature.  In a Hadoop cluster configuration, you can specify information like host names and ports for HDFS, Job Tracker, and other big data cluster components.  The Edit button allows you to edit Hadoop cluster configuration information.  The New button allows you to add a new Hadoop cluster configuration.  Information on Hadoop Clusters can be found in Pentaho Help. |
Folder/File | Specifies the location and/or name of the text file to which to write. Click Browse to launch the Open File window and to navigate to the file or folder. |
Create Parent Folder | Indicates whether a parent folder should be created for the file when it is copied. |
Do not create file at start | Enable to avoid empty files when no rows are getting processed. |
Accept file name from field? | Enables you to specify the file name(s) in a field in the input stream. |
File name field | When the previous option is enabled, you can specify the field that contains the filename(s) at runtime. |
Extension | Adds a point and the extension to the end of the file name (.txt). |
Include stepnr in filename | If you run the step in multiple copies (Launching several copies of a step), the copy number is included in the file name before the extension. (_0). |
Include partition nr in file name? | Includes the data partition number in the file name. |
Include date in file name | Includes the system date in the filename (_20101231) |
Include time in file name | Includes the system time in the filename (_235959) |
Specify Date time format | Allows you to specify the date time format from the list within the Date time format dropdown list.. |
Date time format | Dropdown list of date format options. |
Show file name(s) | Displays a list of the files that are generated. This is a simulation and depends on the number of rows that go into each file. |
Add filenames to result | This adds the filename to the internal file result set. |
Open File
Option | Definition |
---|---|
Open from Folder | Indicates the path and name of the directory you want to browse. Â This directory becomes the active directory. |
Up One Level | Displays the parent directory of the active directory shown in the Open from Folder field. |
Delete | Deletes a folder from the active directory. |
Create Folder | Creates a new folder in the active directory. |
Name | Displays the active directory, which is the one that is listed in the Open from Folder field. |
Filter | Applies a filter to the results displayed in the active directory contents. |
Content Tab
The Content tab contains these options for describing the content being read.
Option | Description |
---|---|
Append | Enables to append lines to the end of the specified file. |
Separator | Specifies the character that separates the fields in a single line of text. Typically this is semicolon ( ; ) or a tab. |
Enclosure | A pair of strings can enclose some fields. This allows separator characters in fields. The enclosure string is optional. Enable if you want the text file to have a header row (first line in the file). |
Force the enclosure around fields? | Forces all field names to be enclosed with the character specified in the Enclosure property above |
Header | Enable this option if you want the text file to have a header row (first line in the file) |
Footer | Enable this option if you want the text file to have a footer row (last line in the file) |
Format | Can be either DOS or UNIX; UNIX files have lines are separated by line feeds, DOS files have lines separated by carriage returns and line feeds |
Encoding | Specify the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings. |
Compression | Specify the type of compression, .zip or .gzip to use when compressing the output. Only one file is placed in a single archive. |
Fast data dump (no formatting) | Improves the performance when dumping large amounts of data to a text file by not including any formatting information. |
Split every ... rows | If the number N is larger than zero, split the resulting text-file into multiple parts of N rows. |
Add Ending line of file | Allows you to specify an alternate ending row to the output file. |
Fields Tab
The fields tab is where you define properties for the fields being exported. The table below describes each of the options for configuring the field properties:
Option | Description |
---|---|
Name | The name of the field |
Type | Type of the field can be either String, Date or Number. |
Format | The format mask to convert with. See Number Formats for a complete description of format symbols. |
Length | The length option depends on the field type follows: |
Precision | The precision option depends on the field type as follows: |
Currency | Symbol used to represent currencies like $10,000.00 or E5.000,00 |
Decimal | A decimal point can be a "." (10,000.00) or "," (5.000,00) |
Group | A grouping can be a "," (10,000.00) or "." (5.000,00) |
Trim type | The trimming method to apply on the string Trimming works when there is no field length given only. |
Null | If the value of the field is null, insert this string into the text file |
Get | Click to retrieve the list of fields from the input fields stream(s) |
Minimal width | Change the options in the Fields tab in such a way that the resulting width of lines in the text file is minimal. So instead of save 0000001, you write 1, and so on. String fields will no longer be padded to their specified length. |
Metadata Injection Support (7.x and later)
All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.