Excel Input (XLS, XLSX) including OpenOffice Workbooks (ODS)
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
Description
The Excel Input step provides you with the ability to read data from one or more Excel and OpenOffice files. The following sections describe each of the available features for configuring the Excel Input step.
Note: The default Spread sheet type (engine) is set to Excel 97-2003 XLS. When you are reading other file types like OpenOffice, ODS, Excel 2007 and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly.
Files Tab
The files tab is where you define the location of the Excel files from which you want to read. The table below contains options associated with the Files tab:
Option | Description |
---|---|
Step Name | Name of the step; the name has to be unique in a single transformation. |
Spread sheet type (engine) | This field allows you to specify the spreadsheet type. (since version 4.1.0) Currently the following are supported:
|
File or directory | Specifies the location and/or name of the input text file. Note: Click Add to add the file/directory/wildcard combination to the list of selected files (grid) below. |
Regular expression | Specify the regular expression you want to use to select the files in the directory specified in the previous option. For example, you want to process all files that have a .txt extension. (For further details see the Text File Input step and the section "Selecting file using Regular Expressions") |
Exclude Regular Expression | Excludes all files (in a given location) that meet the criteria specified by this regular expression. |
Selected Files | Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. If a file is required and it is not found, an error is generated;otherwise, the file name is skipped. |
Accept filenames from previous steps | Allows you to read in file names from a previous step in the transformation. You must also specify which step you are importing from, and the input field in that step from which you will retrieve the filename data |
Show filenames(s)... | Displays a list of all files that will be loaded based on the current selected file definitions |
Preview rows | Click Preview to examine the contents of the specified Excel file |
Sheets
In this tab you can specify the names of the sheets in the Excel workbook to read. For each of the sheet names you can specify the row and column to start at. Note : the row and column numbers are zero (0) based. (start to number at 0).
IMPORTANT: it is also possible to read all sheets in the worksheet. To arrive there, simply clear the "List of sheets to read" table and type the start row and column in the first row. This will be used for all sheets. Again, if you want to read all sheets in the workbook(s), it's important not to specify any sheet name (blank it out). In this case, the field structure of each sheet needs to be the same.
Content
The content tab allows you to configure the following properties:
Option | Description |
---|---|
Header | Enable if the sheets specified contain a header row to skip |
No empty rows | Enable if you don't want empty rows in the output of this step |
Stop on empty row | Makes the step stop reading the current sheet of a file when a empty line is encountered |
Limit | Limits the number of rows to this number (zero (0) means all rows). |
Encoding | Specifies the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.) |
Error handling
The Error handling tab allows you to configure the following properties:
Option | Description |
---|---|
Strict types? | If checked, PDI will report data type errors in the input. |
Ignore errors? | Enable if you want to ignore errors during parsing |
Skip error lines? | If checked, PDI will skip lines that contain errors. These lines can be dumped to a separate file by specifying a path in the Failing line numbers files directory field below. If this is not checked, lines with errors will appear as NULL values in the output. |
Warnings file directory | When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension> |
Error files directory | When errors occur, they are placed in this directory. The name of that file is <errorfile_dir>/filename.<date_time>.<errorfile_extension> |
Failing line numbers files directory | When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension> |
Fields tab
The fields tab is for specifying the fields that must be read from the Excel files. Use Get fields from header row to fill in the available fields if the sheets have a header row automatically.
The Type column performs type conversions for a given field. For example, if you want to read a date and you have a String value in the Excel file, specify the conversion mask. Note: In the case of Number to Date conversion (for example, 20051028--> October 28th, 2005) specify the conversion mask yyyyMMdd because there will be an implicit Number to String conversion taking place before doing the String to Date conversion.
Option | Description |
---|---|
Name | The name of the field. |
Type | The field's data type; String, Date or Number. |
Length | The length option depends on the field type. Number: total number of significant figures in a number; String: total length of a string; Date: determines how much of the date string is printed or recorded. |
Precision | The precision option depends on the field type, but only Number is supported; it returns the number of floating point digits. |
Trim type | Truncates the field (left, right, both) before processing. Useful for fields that have no static length. |
Repeat | If set to Y, will repeat this value if the field in the next row is empty. |
Format | The format mask (number type). See Text File Input step and the section "Number Formats" for a complete description of format symbols. |
Currency | Symbol used to represent currencies. |
Decimal | A decimal point; this is either a dot or a comma. |
Grouping | A method of separating units of thousands in numbers of four digits or larger. This is either a dot or a comma. |
Additional output fields tab
This tab retrieves custom metadata fields to add to the step's output. The purpose of each field is defined in its name, but you can use these fields for whatever you want. Each item defines an output field that will contain the following information. Some of these are missing.
Option | Description |
---|---|
Full filename field | The full file name plus the extension. |
Sheetname field | The worksheet name you're using. |
Sheet row nr field | The current sheet row number. |
Row nr written field | Number of rows written |
Short filename field | The field name that contains the filename without path information but with an extension. |
Extension field | The field name that contains the extension of the filename. |
Path field | The field name that contains the path in operating system format. |
Size field | The field name that contains the size of the file, in bytes. |
Is hidden field | The field name that contains if the file is hidden or not (boolean). |
Uri field | The field name that contains the URI. |
Root uri field | The field name that contains only the root part of the URI. |