Excel Input (XLS, XLSX) including OpenOffice Workbooks (ODS)

PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Description

The Excel Input step provides you with the ability to read data from one or more Excel and OpenOffice files. The following sections describe each of the available features for configuring the Excel Input step.

Note: The default Spread sheet type (engine) is set to Excel 97-2003 XLS. When you are reading other file types like OpenOffice, ODS, Excel 2007 and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly.

Files Tab

The files tab is where you define the location of the Excel files from which you want to read. The table below contains options associated with the Files tab:

Option	Description
Step Name	Name of the step; the name has to be unique in a single transformation.
Spread sheet type (engine)	This field allows you to specify the spreadsheet type. (since version 4.1.0) Currently the following are supported: - Excel 97-2003 XLS: this is the default, backward compatible type provided for by the JXL software backend. - Excel 2007 XLSX (Apache POI): If you select this spread sheet type you can read all known Excel file types. Functionality provided by the Apache POI project. - Excel 2007 XLSX (Apache POI Streaming): This spread sheet type allows to read in large Excel files (Experimental in 5.1). - Open Office ODS: By selecting this type you can read OpenOffice spreadsheet using the ODFDOM engine. The default Spread sheet type (engine) is set to Excel 97-2003 XLS. When you are reading other file types like OpenOffice, ODS, Excel 2007 and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly.
File or directory	Specifies the location and/or name of the input text file. *Note*: Click Add to add the file/directory/wildcard combination to the list of selected files (grid) below.
Regular expression	Specify the regular expression you want to use to select the files in the directory specified in the previous option. For example, you want to process all files that have a .txt extension. (For further details see the Text File Input step and the section "Selecting file using Regular Expressions")
Exclude Regular Expression	Excludes all files (in a given location) that meet the criteria specified by this regular expression.
Selected Files	Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. If a file is required and it is not found, an error is generated;otherwise, the file name is skipped.
Accept filenames from previous steps	Allows you to read in file names from a previous step in the transformation. You must also specify which step you are importing from, and the input field in that step from which you will retrieve the filename data
Show filenames(s)...	Displays a list of all files that will be loaded based on the current selected file definitions
Preview rows	Click Preview to examine the contents of the specified Excel file

Sheets

In this tab you can specify the names of the sheets in the Excel workbook to read. For each of the sheet names you can specify the row and column to start at. Note : the row and column numbers are zero (0) based. (start to number at 0).

IMPORTANT: it is also possible to read all sheets in the worksheet. To arrive there, simply clear the "List of sheets to read" table and type the start row and column in the first row. This will be used for all sheets. Again, if you want to read all sheets in the workbook(s), it's important not to specify any sheet name (blank it out). In this case, the field structure of each sheet needs to be the same.

Content

The content tab allows you to configure the following properties:

Option	Description
Header	Enable if the sheets specified contain a header row to skip
No empty rows	Enable if you don't want empty rows in the output of this step
Stop on empty row	Makes the step stop reading the current sheet of a file when a empty line is encountered
Limit	Limits the number of rows to this number (zero (0) means all rows).
Encoding	Specifies the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.)

Error handling

The Error handling tab allows you to configure the following properties:

Option	Description
Strict types?	If checked, PDI will report data type errors in the input.
Ignore errors?	Enable if you want to ignore errors during parsing
Skip error lines?	If checked, PDI will skip lines that contain errors. These lines can be dumped to a separate file by specifying a path in the Failing line numbers files directory field below. If this is not checked, lines with errors will appear as NULL values in the output.
Warnings file directory	When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>
Error files directory	When errors occur, they are placed in this directory. The name of that file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>
Failing line numbers files directory	When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Fields tab

The fields tab is for specifying the fields that must be read from the Excel files. Use Get fields from header row to fill in the available fields if the sheets have a header row automatically.

The Type column performs type conversions for a given field. For example, if you want to read a date and you have a String value in the Excel file, specify the conversion mask. Note: In the case of Number to Date conversion (for example, 20051028--> October 28th, 2005) specify the conversion mask yyyyMMdd because there will be an implicit Number to String conversion taking place before doing the String to Date conversion.

Option	Description
Name	The name of the field.
Type	The field's data type; String, Date or Number.
Length	The length option depends on the field type. Number: total number of significant figures in a number; String: total length of a string; Date: determines how much of the date string is printed or recorded.
Precision	The precision option depends on the field type, but only Number is supported; it returns the number of floating point digits.
Trim type	Truncates the field (left, right, both) before processing. Useful for fields that have no static length.
Repeat	If set to Y, will repeat this value if the field in the next row is empty.
Format	The format mask (number type). See Text File Input step and the section "Number Formats" for a complete description of format symbols.
Currency	Symbol used to represent currencies.
Decimal	A decimal point; this is either a dot or a comma.
Grouping	A method of separating units of thousands in numbers of four digits or larger. This is either a dot or a comma.

Additional output fields tab

This tab retrieves custom metadata fields to add to the step's output. The purpose of each field is defined in its name, but you can use these fields for whatever you want. Each item defines an output field that will contain the following information. Some of these are missing.

Option	Description
Full filename field	The full file name plus the extension.
Sheetname field	The worksheet name you're using.
Sheet row nr field	The current sheet row number.
Row nr written field	Number of rows written
Short filename field	The field name that contains the filename without path information but with an extension.
Extension field	The field name that contains the extension of the filename.
Path field	The field name that contains the path in operating system format.
Size field	The field name that contains the size of the file, in bytes.
Is hidden field	The field name that contains if the file is hidden or not (boolean).
Uri field	The field name that contains the URI.
Root uri field	The field name that contains only the root part of the URI.