Pentaho Software Architecture
Introduction
The purpose of this document is to provide a detailed view of the overall software components that when combined make up the entire Pentaho open source software suite as it exists today.
At a high level, the software components can be divided into a variety of forms. In the following detailed list, the general organization includes third party libraries and components that Pentaho has needed to fork and maintain, common libraries and projects that are used in general ways, pillars that are core business analytics or data integration elements, tools that allow access to pillars, and plugins across the pillars that provide additional functionality. These same components can be looked at from a architectural purpose point of view, including four general areas including information delivery, data management / integration, analytics / reporting, and platform services. For each project below we categorize in both manners to give a multi-faceted view of the overall architecture of Pentaho.
Cross Cutting Architectures, Best Practices and Use Cases
This section discusses high level cross cutting software architectures and use cases.
Configuration Management
At this time, Pentaho utilizes a combination of SVN and GIT for managing the source. Here are some related articles:
http://wiki.pentaho.com/display/PEOpen/Advanced+Git+Topics
Metadata Definitions
As we continue to build a community of projects, it's important that they share terminology and common metadata. Here's the beginnings of capturing shared metadata to be used across all Pentaho projects:
http://wiki.pentaho.com/display/COM/Standard+MetaStore+Element+types
Javascript Development Guidelines
Pentaho's core technology is developed within the Java Platform, but more and more the need for rich browser-based applications is becoming critical. Pentaho has a number of components that are browser only. It's important that we share a common approach across these projects. Here is the beginnings of our Javascript Development Guidelines:
http://wiki.pentaho.com/display/ServerDoc2x/Javascript+Development+Guidelines
Pentaho Prompting API
This is a generally useful library used in a variety of contexts, and is part of the Common UI plugin to the Pentaho Platform.
http://wiki.pentaho.com/display/ServerDoc2x/Pentaho+Prompting+API
Pentaho Coding Standards
Cross cutting coding standards for all modules of the Pentaho suite can be found on our github project, this includes configurations for the most popular IDEs.
https://github.com/pentaho/pentaho-coding-standards
Additional content needed around:
Visualizations
Logging
Plugin Architectures
Platform / BA Server Related
Scheduling and Background Execution in Pentaho User Console: /wiki/spaces/PMOPEN/pages/1249389595
Intro for creating a REST service for the BA Server: How to create and register a new REST service from a plugin
Developing Plugins Developing Plugins
Kettle Related
Extending Kettle (Infocenter SDK)
UI Technologies
Datasources
Detailed Software Listing
This detailed software listing is organized in the general order in which software components are dependent on one another, although it should not be used as the official build order of Pentaho.
Third Party Maintained Forks
Common Components
Pillars
Tools
Plugins
Third Party Maintained Forks
It is Pentaho's intention to avoid having to fork and maintain third party open source software, but on a few occasions it has been necessary. The following list is of the current third party maintained forks that Pentaho includes in our product.
kettle-vfs
Kettle VFS is a maintained fork of Apache Commons VFS
Source Path: svn://source.pentaho.org/svnkettleroot/kettle-vfs
Architectural Owner: Matt Casters
Architectural Area: Data Management / Integration
hive
Due to the dynamic nature of Hadoop, Pentaho currently maintains our own Hive JDBC Driver implementation
Source Path: https://github.com/pentaho/hive
Architectural Owner: Will Gorman
Architectural Area: Data Management / Integration
pentaho-ofc4j
Pentaho ChartBeans Flash components, which are still used by Pentaho Dashboards and Action Sequences, are based on Open Flash Chart. OFC4J is a Java to JSON converter that is used to generate the correct metadata for the charts on the server that is no longer maintained by the creator of the project.
Source Path: https://github.com/pentaho/pentaho-ofc4j
Architectural Owner: Will Gorman
Architectural Area: Information Delivery
Common Components
This is a list of all the common libraries that Pentaho maintains that are included as part of the Pentaho Suite of technologies. Each common component has a specific purpose, and may be used by one or more pillars.
subfloor
Subfloor is Pentaho's common build system, based on ant and used by all projects for compilation, assembly, unit testing and code coverage.
Source Path: https://code.google.com/p/subfloor/ (Note that this location is out of date and should be transitioned to GitHub)
Architectural Owner: Will Gorman
Architectural Area: Build
pentaho-commons-database
This commons project is a GWT thin client of the shared database dialog. The submodule pentaho-database-model was an attempt at a thin Kettle DatabaseMeta implementation, which includes a dialect and JDBC Metadata architecture.
Source Path: https://github.com/pentaho/pentaho-commons-database
Architectural Owner: Will Gorman
Architectural Area: Data Management / Integration
pentaho-connections
This commons project provides an API for interacting with platform connections, usually within the context of xactions. Pentaho Metadata also uses this API for providing access to metadata in result sets.
Source Path: https://github.com/pentaho/pentaho-connections
Architectural Owner: Will Gorman
Architectural Area: Platform Services
pentaho-hdfs-vfs
This commons project provides an apache-vfs implementation of Hadoop HDFS. This is used by Kettle for accessing HDFS in a number of contexts.
Source Path: https://github.com/pentaho/pentaho-hdfs-vfs
Architectural Owner: Will Gorman
Architectural Area: Data Management / Integration
pentaho-vfs
This is a single xml file project that allows the loading of the platform's vfs driver, that maps to the scheme "solution:"
Source Path: svn://source.pentaho.org/svnroot/pentaho-commons/pentaho-vfs
Architectural Owner: Will Gorman
Architectural Area: Platform Services
pentaho-versionchecker
This project hosts the APIs and basic logic for checking the version information of a currently running platform.
Source Path: https://github.com/pentaho/pentaho-versionchecker
Architectural Owner: Will Gorman
Architectural Area: Platform Services
pentaho-palo-core
This project is an abstraction layer that insulates the uses of Palo within the product so there is no compile time dependency, due to Palo's GPL licensing.
Source Path: https://github.com/pentaho/pdi-palo-core
Architectural Owner: Matt Casters
Architectural Area: Data Management / Integration
pentaho-registry
The Pentaho Registry project was developed in the context of Instaview to provide an API for a system wide metadata index for purposes of understanding relationships and lineage.
Source Path: https://github.com/pentaho/pentaho-registry
Architectural Owner: Will Gorman
Architectural Area: Platform Services
metastore
The Pentaho Metastore is an API that defines a simple way for interacting with a global metadata repository, and will be used as the main API for managing and storing global metadata definitions, such as connections, etc.
Source Path: https://github.com/pentaho/metastore
Architectural Owner: Matt Casters
Architectural Area: Platform Services
pentaho-cwm (deprecated)
This is a legacy project that defines the Pentaho to CWM object model API mapping, the code in the project is mostly generated, and is deprecated with the new Pentaho Metadata thin API. This module can be completely removed once Pentaho Metadata Editor is replaced by a new editor.
Source Path: https://github.com/pentaho/pentaho-cwm
Architectural Owner: Will Gorman
Architectural Area: Analytics / Reporting (Pentaho Metadata)
pentaho-metadata
The Pentaho Metadata project defines a reporting metadata architecture and implementations. The core components of Pentaho Metadata API are the Business Query (MQL), Logical Model, and Physical Models (CSV and SQL). Long term plans are to merge these concepts into Pentaho Mondrian and the Metastore, but this will be an evolutionary approach and will take time before Pentaho Metadata is deprecated.
Source Path: https://github.com/pentaho/pentaho-metadata
Architectural Owner: Will Gorman
Architectural Area: Analytics / Reporting
pentaho-commons-gwt-modules
The gwt-modules project consists of the gwt-widgets and gwt-widgets-samples submodules. This is a common area to keep reused gwt capabilities for our downstream GWT-based projects, including at this time pentaho user console, data source wizard, and others.
Source Path: https://github.com/pentaho/pentaho-commons-gwt-modules
Architectural Owner: Nick Baker
Architectural Area: Information Delivery
pentaho-xul
This is a general UI framework developed by Pentaho to address our cross platform UI requirements, it includes an implementation for Swing, GWT and SWT. One of the key requirements to this framework is pluggability supported via overlays. Major projects using this framework include the Aggregation Designer, the Modeler, PUC, Report Designer, Spoon, MQL Editor, and others.
Source Path: https://github.com/pentaho/pentaho-commons-xul
Architectural Owner: Nick Baker
Architectural Area: Information Delivery
modeler
This project provides an easy to use Pentaho Modeler, which is able to generate both Pentaho Metadata and Mondrian file formats. It was originally built within the context of Agile BI, but is also currently used in the Data Source Wizard. The plan is to eventually replace Mondrian Schema Workbench and Pentaho Metadata Editor with this UI.
Source Path: https://github.com/pentaho/modeler
Architectural Owner: Nick Baker
Architectural Area: Analytics / Reporting
pentaho-mql-editor
This project provides a simple UI for business users to define Pentaho Metadata Queries. It is based on XUL so it can be used in Swing, SWT and GWT.
Source Path: https://github.com/pentaho/mql-editor
Architectural Owner: Nick Baker
Architectural Area: Information Delivery
pentaho-actionsequence-dom
This project provides a document object model for core action sequence file parsing. It also manages validation of action sequence files (.xaction). In the long term Pentaho is replacing Action Sequence functionality with Kettle Transformations and Job capabilities.
Source Path: https://github.com/pentaho/pentaho-actionsequence-dom
Architectural Owner: Will Gorman
Architectural Area: Platform Services
pentaho-chartbeans (deprecated)
This project was Pentaho's early attempt at a common Chart definition and API. We have sense began a transition to the Visualization API, which is part of the pentaho platform plugin common-ui. This project is still in use today by Dashboard Designer and Action Sequences, but will be phased out in a future release.
Source Path: https://github.com/pentaho/pentaho-chartbeans
Architectural Owner: Will Gorman
Architectural Area: Information Delivery
Pillars
These projects make up the core of Pentaho's architecture, and are referred to as the BI Pillars within Pentaho.
mondrian
Mondrian is a relational OLAP engine. It supports Multidimensional Expressions (MDX) and requires a JDBC based star schema. It includes a Swing based schema designer called Schema Workbench.
Source Location: https://github.com/pentaho/mondrian
Architectural Owner: Julian Hyde
Architectural Area: Analytics / Reporting
pentaho-reporting
Pentaho Reporting is a banded reporting engine that supports many different datasources (JDBC, Kettle, etc) and outputs (PDF, Excel, CSV, etc). It includes a Swing based designer called Report Designer.
Source Location: https://github.com/pentaho/pentaho-reporting
Architectural Owner: Thomas Morgner
Architectural Area: Analytics / Reporting
kettle
Kettle is a metadata-based data integration engine. There are two runtimes, data transformation and job orchestration. Kettle includes an SWT based designer called Spoon for design and execution, as well as a server called Carte for remote execution and command line tools for execution as well.
Source Location: https://github.com/pentaho/pentaho-kettle
Architectural Owner: Matt Casters
Architectural Area: Data Management / Integration
pentaho-platform
The Pentaho Platform, which is sometimes synonymous with the Business Analytics Server, is primarily a server runtime and user interface (referred to as Pentaho User Console) for hosting Business Analytics Applications, such as report execution, OLAP, dashboards, and transformation execution. The platform contains a legacy workflow engine called Action Sequences, that are slowly being phased out and replaced with Kettle Jobs and Transformations.
Source Location: https://github.com/pentaho/pentaho-platform
Architectural Owner: Nick Baker
Architectural Area: Platform Services
Related Architectural Documentation:
- Pentaho ObjectFactory and Spring Enhancements
- Developing Plugins
- /wiki/spaces/PMOPEN/pages/1249389595
- How to create and register a new REST service from a plugin
- How to register a new action based security (ABS) permission from a plugin
weka
Weka is a suite of machine learning software written in Java.
Source Location: https://svn.cms.waikato.ac.nz/svn/weka/
Architectural Owner: Mark Hall
Architectural Area: Analytics / Reporting
Tools
Many of the applications that make up Pentaho are included within the pillars mentioned above, these remaining tools are not managed within a pillar and thus referenced separately.
pentaho-agg-designer
The Aggregation Designer is a wizard that allows for manual and automatic creation of aggregates for use with Mondrian. This application is based on pentaho-xul, and by default runs within the Swing container. This functionality will eventually be merged into the Pentaho Modeler.
Source Location: https://github.com/pentaho/pentaho-aggdesigner
Architectural Owner: Will Gorman
Architectural Areas: Analytics / Reporting
pentaho-metadata-editor
The Pentaho Metadata Editor allows for editing of Pentaho Metadata schemas. This application was developed using SWT. This functionality will eventually be merged into the Pentaho Modeler.
Source Location: https://github.com/pentaho/pentaho-metadata-editor
Architectural Owner: Will Gorman
Architectural Areas: Analytics / Reporting
Plugins
Many of the pillars have pluggable entry points that allow for extensibility of the suite. This is a listing of the plugins that are bundled by default within the Pentaho community suite.
Pentaho Platform Plugins
common-ui
common-ui hosts a number of shared thin client and server components used by other plugins. Major common components within common-ui include the thin client data api, the visualization API, a shared client-side prompting library, general common ui components, and more.
Source Location: https://github.com/pentaho/pentaho-platform-plugin-common-ui
Architectural Owner: Nick Baker
Architectural Area: Information Delivery
data-access
data-access is a plugin that hosts the data source wizard and other common data access components such as the database connection dialog and the thin modeler. The client code is based on GWT-XUL, and the back-end web services are a mix of GWT-RPC and SOAP based web service calls.
The datasource wizard allows for creation of Mondrian and Pentaho Metadata based datasources. There is a CSV option that stages the data into a single table, and a database option where users can specify relationships between tables. The datasource wizard will automatically generate schemas that then can be customized later by the user.
Source Location: https://github.com/pentaho/data-access
Architectural Owner: Nick Baker
Architectural Area: Data Management / Integration
cdf
cdf is short for "community dashboard framework", which provides a general framework for dashboarding, including a pluggable component architecture. WebDetails primarily drives the direction of CDF.
Source Location: https://github.com/webdetails/cdf
Project Homepage: http://www.webdetails.pt/ctools/cdf.html
Architectural Owner: Pedro Alves
Architectural Area: Information Delivery
cda (not bundled)
Community Data Access (CDA) is a Pentaho plugin designed for accessing data with great flexibility. It is used in a number of contexts within the community, and also used as a library in Pentaho Dashboard Designer. CDA utilizes Pentaho Reporting's Data Factory and Kettle's Transformation Engines for data access.
Source Location: https://github.com/webdetails/cda
Project Homepage: http://www.webdetails.pt/ctools/cda.html
Architectural Owner: Pedro Alves
Architectural Area: Data Management / Integration
reporting
The reporting plugin contains the elements necessary for executing Pentaho Reporting within the context of the platform. Included is a thin- client report viewer based on CDF, and backend execution components for running and scheduling reports.
Source Location: https://github.com/pentaho/pentaho-platform-plugin-reporting
Architectural Owner: Thomas Morgner
Architectural Area: Information Delivery
Architectural overview: DOC and PPT
pdi
The PDI plugin allows native execution of transformations and jobs within the BI Server. At this time it is not bundled by default in the BI Server.
Source Location: https://github.com/pentaho/pdi-platform-plugin
Architectural Owner: Nick Baker
Architectural Area: Data Management / Integration
Kettle Plugins
agile-bi
The Agile BI Plugin allows a Spoon user to quickly visualize a relational database table in a free embedded version of Pentaho Analyzer. Agile BI will automatically generate a mondrian schema, It includes the pentaho-modeler for editing the schema after auto-creation.
Source Location: https://github.com/pentaho/pdi-agile-bi-plugin
Archtiectural Owner: Nick Baker
Architectural Area: Information Delivery
big-data-plugin
The Big Data Plugin contains all the related big data integration components, including HDFS, Map Reduce, Visual MapReduce, Pig, Oozie, Sqoop, Hive, Avro, HBase and CouchDB components.
Source Location: https://github.com/pentaho/big-data-plugin
Architectural Owner: Matt Casters
Architectural Area: Data Management / Integration
pentaho-hadoop-shims
This Hadoop Shims library is a sub-component of the big-data-plugin and contains the necessary adaptive layers between the big data kettle integration components and specific hadoop versions.
Source Location: https://github.com/pentaho/pentaho-hadoop-shims
Architectural Owner: Matt Casters
Architectural Area: Data Management / Integration
pentaho-mongodb-plugin
This plugin includes input and output steps for MongoDB.
Source Location: https://github.com/pentaho/pentaho-mongodb-plugin
Architectural Owner: Mark Hall
Architectural Area: Data Management / Integration
pentaho-cassandra-plugin
This plugin includes input and output steps for Cassandra
Source Location: https://github.com/pentaho/pentaho-cassandra-plugin
Architectural Owner: Mark Hall
Architectural Area: Data Management / Integration
pentaho-vertica-bulkloader
This plugin includes an HP Vertica bulk loader.
Source Location: https://github.com/pentaho/pentaho-vertica-bulkloader
Architectural Owner: Matt Casters
Architectural Area: Data Management / Integration
ArffOutput
The ARFF output plugin is a tool that allows you to output data from Kettle to a file in WEKA's Attribute Relation File Format (ARFF).
Source Location: https://github.com/pentaho/pdi-weka-arff-output-plugin
Architectural Owner: Mark Hall
Architectural Area: Analytics / Reporting
Documentation Link: http://wiki.pentaho.com/display/DATAMINING/Using+the+ARFF+Output+Plugin
WekaScoring
The Weka scoring plugin is a tool that allows classification and clustering models created with Weka to be used to "score" new data as part of a Kettle transform.
Source Location: https://github.com/pentaho/pdi-weka-scoring-plugin
Archtiectural Owner: Mark Hall
Architectural Area: Analytics / Reporting
Documentation Link: http://wiki.pentaho.com/display/DATAMINING/Using+the+Weka+Scoring+Plugin
Schema Workbench Plugins
pentaho-mondrianschemaworkbench-plugins
This plugin to Mondrian Schema Workbench allows for publishing of a Mondrian schema to the Pentaho Business Analytics Server.
Source Location: https://github.com/pentaho/pentaho-mondrianschemaworkbench-plugins
Architectural Owner: Will Gorman
Architectural Area: Analytics / Reporting