Google Docs Input
The Google Docs Input step provides you with the ability to read data from one or more Google Docs spreadsheets.
The following sections describe each of the available features for configuring the Google Docs Input step. If necessary, refer to the Google Dimensions and Metrics Reference.
Prerequisites
You must have a Google Docs account, an installation of PDI 3.2, and the Google Docs plug-in.
Note: The Google Docs plug-in is available to Pentaho enterprise customers only. The plug-in can be downloaded from the Pentaho FTP site and must be installed in your
...\plugins\steps
directory.
Files
The Files tab is where you define the location of the Google Docs files that you want read. The table below contains options associated with the Files tab:
Option |
Description |
---|---|
Step Name |
Unique name for the step |
Username |
Your Google Docs account user name |
Password |
Your Google Docs account password |
Google Docs Object ID |
Key to the Google document from which you want to read data - Note: The key is included in the URL associated with the document; your entry must be in the following format spreadsheet%pBb5yoxtYzKEyXDB9eqsNVG. Click Lookup to display the list of available keys. |
Sheets
The options in the Sheets tab allow you to specify the names of the sheets in the Google Docs workbook to read. For each of the sheet names you can specify the row and column to start at.
Note: The row and column numbers are zero (0) based; start to number at 0.
IMPORTANT: It is also possible to read all sheets in the worksheet. To read all sheets disable the "List of sheets to read" table and type the start row and column in the first row. When you want to read all sheets in the workbook(s), it is important not to specify any specific sheet name.
Content
The content tab allows you to configure the following properties:
Option |
Description |
---|---|
Header |
Enable if the sheets specified contain a header row to skip |
No empty rows |
Enable if you don't want empty rows in the output of this step |
Stop on empty row |
Makes the step stop reading the current sheet of a file when a empty line is encountered |
Filename |
Specifies a field name to include the file name in the output of this step. |
Sheetname |
Specifies a field name to include the sheet name in the output of this step. |
Sheer row nr |
Specifies a field name to include the sheet row number in the output of the step. The sheet row number is the actual row number in the Google Docs sheet. |
Row nrwritten |
Specifies a field name to include the row number in the output of the step. "Row number written" is the number of rows processed, starting at 1 and counting indefinitely |
Limit |
Limits the number of rows to this number (zero (0) means all rows). |
Encoding |
Specifies the character encoding (such as UTF-8, ASCII) |
Error handling
The Error handling tab allows you to configure the following properties:
Option |
Description |
---|---|
Strict types? |
Certain columns in the Google Docs input step can be flagged as numbers, strings, dates, and so on. Once flagged, if a column does not contain the right data type; for example, the column was flagged as numeric but contains a string input, an error occurs. |
Ignore errors? |
Enable if you want to ignore errors during parsing |
Skip error lines? |
Enable if you want to skip those lines that contain errors. Note: you can generate an extra file that will contain the line numbers on which the errors occurred. If lines with errors are not skipped, the fields that did have parsing errors, will be empty (null). |
Warnings file directory |
When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension> |
Error files directory |
When errors occur, they are placed in this directory. The name of that file is <errorfile_dir>/filename.<date_time>.<errorfile_extension> |
Failing line numbers files directory |
When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension> |
Fields
The fields tab is for specifying the fields that must be read from the Google Docs files. Use Get fields from header row to fill in the available fields if the sheets have a header row automatically.
The Type column performs type conversions for a given field. For example, if you want to read a date and you have a String value in the Google Docs file, specify the conversion mask.
Note: In the case of Number to Date conversion (for example, 20081028--> October 28th, 2008) specify the conversion mask yyyyMMdd because there will be an implicit Number to String conversion taking place before doing the String to Date conversion.