AEL and Spark Library Conflicts
The Problem
There are a number of different places from which JAR files originate during execution of a transformation in the AEL engine:
- The set of JARs in data-integration/lib
- JARs from spark-install/jars
- JARs from the Hadoop classpath
- AEL JARs in Karaf
- JARs from kettle plugins (OSGi and otherwise)
In some cases, library versions contained in these different locations can and will conflict, causing general problems where Spark libraries conflict with Hadoop libraries [1]. It also has the potential to create AEL specific problems [2].
Library conflicts have produced several bugs, both within AEL code and from Spark, in general:
- http://jira.pentaho.com/browse/BACKLOG-19577
- http://jira.pentaho.com/browse/BACKLOG-19655
- http://jira.pentaho.com/browse/BACKLOG-19573
OSGi is valuable specifically because it addresses these sorts of problems, and luckily most of AEL execution happens within Karaf. The places of vulnerability, however, are:
- Execution which occurs outside of the engine, not leveraging Karaf (like SparkWebSocketMain).
- The set of packages specified by org.osgi.framework.system.packages.extra (from karaf/etc/custom.properties). That is, the set of packages exposed from the framework classloader.
As of Pentaho 8.0, running AEL with Spark 2.1.0, the set of JARs in conflict between spark-install/jars and data-integration/lib are the following 24 libraries:
PDI 8.0 | SPARK 2.1.0 |
activation-1.1.jar | activation-1.1.1.jar |
antlr-complete-3.5.2.jar | antlr-2.7.7.jar |
commons-beanutils-1.9.3.jar | commons-beanutils-1.7.0.jar |
commons-configuration-1.9.jar | commons-configuration-1.6.jar |
commons-io-2.2.jar | commons-io-2.4.jar |
commons-lang3-3.0.jar | commons-lang3-3.5.jar |
commons-net-1.4.1.jar | commons-net-2.2.jar |
commons-pool-1.5.7.jar | commons-pool-1.5.4.jar |
derby-10.2.1.6.jar | derby-10.12.1.1.jar |
eigenbase-properties-1.1.2.jar | eigenbase-properties-1.1.5.jar |
httpclient-4.5.3.jar | httpclient-4.5.2.jar |
httpcore-4.4.6.jar | httpcore-4.4.4.jar |
jackson-annotations-2.3.3.jar | jackson-annotations-2.6.5.jar |
jackson-core-2.3.3.jar | jackson-core-2.6.5.jar |
jackson-core-asl-1.9.2.jar | jackson-core-asl-1.9.13.jar |
jackson-databind-2.3.3.jar | jackson-databind-2.6.5.jar |
jackson-jaxrs-1.9.2.jar | jackson-jaxrs-1.9.13.jar |
jackson-mapper-asl-1.9.2.jar | jackson-mapper-asl-1.9.13.jar |
jackson-xc-1.9.3.jar | jackson-xc-1.9.13.jar |
janino-2.5.16.jar | janino-3.0.0.jar |
jersey-client-1.19.1.jar | jersey-client-2.22.2.jar |
jersey-server-1.19.1.jar | jersey-server-2.22.2.jar |
jetty-util-8.1.15.v20140411.jar | jetty-util-6.1.26.jar |
joda-time-1.6.jar | joda-time-2.9.3.jar |
slf4j-api-1.7.7.jar | slf4j-api-1.7.16.jar |
slf4j-log4j12-1.7.7.jar | slf4j-log4j12-1.7.16.jar |
snappy-java-1.1.0.jar | snappy-java-1.1.2.6.jar |
validation-api-1.0.0.GA.jar | validation-api-1.1.0.Final.jar |
Of these libraries, the set of packages exposed from the framework classloader boil down to these packages:
com.sun.jersey.api.client org.apache.commons.configuration org.apache.commons.pool org.apache.commons.pool.impl org.apache.http org.apache.http.client.utils org.slf4j
Since these packages are provided via the framework classloader, and are loaded from indeterminate library versions, there's inherent risk that undesired and unpredictable behavior could result.
Risk Mitigation
To reduce risk, follow these steps.
- Test specific Spark and Hadoop versions and recommend sticking to that set.
- Minimize usage of classes within the above packages. Update the list of potentially conflicting packages as new releases come out.
- Wherever possible, leverage classes injected via blueprint within AEL.
- Avoid usage of libraries that overlap with Hadoop / Spark libraries for any packages retrieved via the framework classloader.
References
[1] https://markobigdata.com/2016/08/01/apache-spark-2-0-0-installation-and-configuration
https://www.hackingnote.com/en/spark/trouble-shooting/NoClassDefFoundError-ClientConfig/