PMML Support in Weka
News
04/22/10 - SupportVectorMachineModel is now supported!
06/22/09 - RuleSetModel is now supported.
02/26/09 - TreeModel is now supported.
02/08/09 - Feedback from the PMML testing web page has resulted in some bug fixes and improvements (e.g. derived fields can now reference other derived fields as long as the referred field is declared before the referring field). Get these latest improvements via the download link above.
09/15/08 - Neural network, TransformationDictionary, LocalTransformation and DerivedFieldare now supported.
Overview
What is PMML?
The Predictive Modeling Markup Language (PMML) is a vendor-agnostic XML-based standard for expressing statistical and data mining models. Applications can produce and consume PMML models, thus allowing a model created in one application to be consumed and used for scoring (prediction) in another. The PMML standard is maintained by the Data Mining Group (DMG).
What PMML model types are supported?
Support for importing PMML models into Weka is under development. Implementation of the PMML (v 3.2) model types Regression, GeneralRegression, NeuralNetwork, TreeModel, RuleSetModel and SupportVectorMachineModel is complete. Support for other model types will follow in the future. The current plan is to implement support for (in order): naive Bayes, association rules and clustering models. This wiki page will be updated with new information and new download archives as more features are implemented.
What are the current limitations of Weka's PMML support?
Only PMML Regression, GeneralRegression, NeuralNetwork, TreeModel, RuleSetModel and SupportVectorMachineModel are implemented so far. GeneralRegression supports a single Predictor-to-Parameter matrix (i.e. in the case of classification, each target class value shares the same PPMatrix). Aggregate and MapValues expressions are not supported yet. The first six of the eleven PMML built-in functions are supported so far. There is no support for exporting PMML models from Weka yet.
How will I be able to use PMML models with Pentaho?
PMML models will be able to be used in several different contexts: 1) In the Weka GUIs (Explorer and KnowledgeFlow) or from the command line, a PMML model will be able to be loaded and applied to test data to score it. Since Weka's implementation of PMML import renders a PMML model as a standard (albeit immutable) Weka Classifier, all the standard Weka evaluation metrics will be available for evaluating performance on the test set (if it contains reference target values); 2) Using the Weka scoring plugin for Pentaho Data Integration (Kettle), PMML models will be able to be deployed for scoring as part of an ETL job.
Integration of PMML support into the Weka scoring plugin and a new PMML classifier scoring plugin for the Weka KnowledgeFlow have been completed (see below for example usage and screenshots). From Weka 3.6.0, PMML models can be run from the Classify panel in Weka's Explorer user interface and from the command line.
Example Output
Below is some example output of Weka's implementation of PMML GeneralRegression (multinomial logistic in this case) and the first few predictions (probability distributions over the class values) for some test data for the famous Irisdataset:
PMML version 3.2 PMML Model: multinomialLogistic Mining schema: @attribute class {Iris-setosa,Iris-versicolor,Iris-virginica} usage: predicted outlier treatment: asIs missing value treatment: asIs @attribute sepal_length numeric usage: active outlier treatment: asIs missing value treatment: asIs @attribute sepal_width numeric usage: active outlier treatment: asIs missing value treatment: asIs @attribute petal_length numeric usage: active outlier treatment: asIs missing value treatment: asIs @attribute petal_width numeric usage: active outlier treatment: asIs missing value treatment: asIs Covariates: sepal_length sepal_width petal_length petal_width Predictor-to-Parameter matrix: Predictor Parameter sepal_length sepal_width petal_length petal_width Intercept sepal_length 1 sepal_width 1 petal_length 1 petal_width 1 Parameter estimates: class Coeff. df Iris-setosa Intercept 33.1503 1 sepal_length 11.8531 1 sepal_width 13.2994 1 petal_length -26.9143 1 petal_width -37.9972 1 Iris-versicolor Intercept 42.6378 1 sepal_length 2.4652 1 sepal_width 6.6809 1 petal_length -9.4294 1 petal_width -18.2861 1 Found class class in test data. Actual: Iris-setosa Predicted: 0.999999999999996 4.0732051602909886E-15 6.290640809163842E-42 Actual: Iris-setosa Predicted: 0.9999999999992712 7.287039008004535E-13 5.20202200281695E-38 Actual: Iris-setosa Predicted: 0.9999999999997793 2.2066439218458243E-13 2.640414975828411E-39 Actual: Iris-setosa Predicted: 0.9999999999638924 3.610752987621063E-11 7.108436456076591E-36 ...
Here is another example. This shows the output from Weka's implementation of PMML Regression (polynomial regression in this case) and the first few predictions for some test data on the Elninodataset:
PMML version 3.0 PMML Model: polynomialRegression Mining schema: @attribute buoy numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = 37.5481) @attribute day numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = 8.8283) @attribute latitude numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = 5.0354) @attribute longitude numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = -106.1912) @attribute zon_winds numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = -4.8239) @attribute mer_winds numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = 2.6773) @attribute humidity numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = 84.5448) @attribute airtemp numeric usage: predicted outlier treatment: asIs missing value treatment: asIs @attribute s_s_temp numeric usage: active outlier treatment: asIs missing value treatment: asValue (replacementValue = 28.2222) Regression table: airtemp = 0.0894 * buoy + -0.0107 * day + 0.0178 * latitude + 0.002 * longitude + 0.0389 * zon_winds + -0.0643 * mer_winds + -0.0345 * humidity + 0.7101 * s_s_temp + -0.0031 * buoy^2 + -0.0061 * day^2 + 0.0038 * latitude^2 + 0.0186 * zon_winds^2 + -0.0134 * mer_winds^2 + 0 * buoy^3 + 0.0004 * day^3 + 0 * longitude^3 + 0.0013 * zon_winds^3 + 10.0055 Found class airtemp in test data. Actual: 27.32 Predicted: 27.15551895992993 Actual: 26.7 Predicted: 27.23171331397675 Actual: 27.36 Predicted: 27.24651843122208 Actual: 27.32 Predicted: 27.300579679426757 Actual: 27.09 Predicted: 27.03963237958885 Actual: 26.82 Predicted: 27.12223258541705 ...
Weka's implementation of TreeModel for classification and regression trees implements Weka's Drawable interface, which allows the tree to be output in the Dot language used by the excellent Graphviz graph visualization software from AT&T Research. This enables the tree to be visualized by Weka's built-in TreeVisualizer or by other tools that support the Dot language. Here is a visualization of a PMML tree generated by SPSS Clementine from the Cleveland heart disease data.
Using PMML Models in the Weka Scoring Kettle Plugin
Once the Weka PMML library is installed in the same directory as the Weka scoring plugin in your Kettle plugins directory, using PMML models is simple and follows the same procedure as using a standard serialized Weka model (for more information on using the Weka scoring plugin, see the documentation provided with the distribution).
The following screenshot shows browsing for PMML model files from the WekaScoring file browser.
The next screenshot shows the "HEART_NOMREG" PMML GeneralRegression model loaded into the Weka scoring plugin.
Scoring Data using the PMML Classifier Scoring KnowledgeFlow Plugin
The PMML classifier scoring plugin for the KnowledgeFlow allows PMML classification and regression models to be loaded and used to score incoming batches of instances or instance streams in the KnowledgeFlow. Below are some example screenshots showing the PMML classifier scoring plugin, with a PMML binomial logistic regression model loaded, accepting an instance stream from the UCI Cleveland heart disease dataset. Evaluation metrics are computed by the incremental classifier evaluator component and displayed in a text viewer. Predictions for the data are appended and saved to a new ARFF file via the prediction appender and the ARFF saver components.
Using the PMML Library Programatically
import weka.core.pmml.PMMLFactory; import weka.core.pmml.PMMLModel; import weka.classifiers.pmml.consumer.PMMLClassifier; ... PMMLModel model = PMMLFactory.getPMMLModel("<path to PMML xml file>"); System.out.println(model); if (model instanceof PMMLClassifier) { PMLClassifier classifier = (PMMLClassifier)model; // Since PMMLClassifier is a subclass of weka.classifiers.Classifier, // you can use it just like any other Weka Classifier. The only // exception is that calling buildClassifier() will raise an // Exception because PMML models are pre-built. }