Pentaho Data Mining Community Documentation
Quick Start and Overview
Pentaho Data Mining, based on Weka project, is a comprehensive set of tools for machine learning and data mining. Its broad suite of classification, regression, association rules, and clustering algorithms can be used to help you understand the business better and also be exploited to improve future performance through predictive analytics.
There are two versions of Weka:
- Weka 3.8 - current stable version. This branch receives bug fixes to core Weka; new features are released through packages that can be installed via the built-in package manager.
- Weka 3.9 - development branch. This is a continuation of the 3.8 code line that receives both bug fixes and new features/improvements to core Weka. It also takes advantage of new features released in packages.
- Pentaho Data Mining Home Page (News, Downloads, Forums, Bug tracking etc.)
- Pentaho Data Mining Forum
- FAQ
- Pentaho Data Mining Screenshots
- Official Weka MOOC site
- Weka MOOCs YouTube channel
- Rushdi Shams Weka tutorial YouTube channel
- Video Tutorial: Pentaho Data Mining Overview and Use Case
- A collection of Weka videos contributed by Bill Claster
- A nice introductory article on data mining with Weka at IBM Developerworks by Michael Abernethy
- Tutorial slides on Weka from dataminingtools.net
Documentation
Pentaho Data Mining (Weka)
- English documentation for Weka 3.6.14 (stable book 3rd ed. version)
- English documentation for Weka 3.8.0 (latest stable version)
- English documentation for Weka 3.9.0 (development version)
- Wiki at wikispaces.com
- What's new in Weka 3.8.3 and 3.9.3
- What's new in Weka 3.8.2 and 3.9.2
- What's new in Weka 3.8.1 and 3.9.1
- What's new in Weka 3.8.0 and 3.9.0
- What's new in Weka 3.7.13
- What's new in Weka 3.7.12
- What's new in Weka 3.7.11
- What's new in Weka 3.7.10
- What's new in Weka 3.7.8
- What's new in Weka 3.7.7
- What's new in Weka 3.7.6
- What's new in Weka 3.7.5
- What's new in Weka 3.7.4
- What's new in Weka 3.7.3
- What's new in Weka 3.7.2
- What's new in Weka 3.7.1
- What's new in Weka 3.7.0
- What's new in Weka 3.6.0
- What's new in Weka 3.5.8
- Data Mining Algorithms and Tools in Weka
- A white paper on deploying Weka models with Pentaho
- Time Series Analysis and Forecasting with Weka
- R-Project integration
- Handling large data sets with Weka
- Weka Execution in Hadoop
- Cost/Benefit tool for analysis of direct mail applications
- Running and using Weka server instances
There is a book that has been written to accompany Weka - Data Mining: Practical Machine Learning Tools and Techniques (Fourth Edition).
Plugins for Pentaho Data Integration (Kettle)
- Using the Weka Scoring Plugin (download)
- Using the Reservoir Sampling Plugin (included as a first class step in recent Kettle distributions)
- Using the ARFF Output Plugin (download)
- Using the Univariate Statistics Plugin (included as a first class step in recent Kettle distributions)
- Using the Knowledge Flow Plugin (enterprise edition)
- Time Series Analysis and Forecasting with Weka (available as a PDI Spoon perspective as well as a Weka plugin)
- Weka time series forecasting plugin for PDI 4 (enterprise edition)
- 3D Visualization Perspective for PDI 4 (download)
Developing with Weka
Awards and Publications
- Ian H. Witten, Eibe Frank, Mark A. Hall and Christopher J. Pal. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 4th edition, 2016.
- Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. WEKA-experiences with a java open-source project. Journal of Machine Learning Research, 11:2533-2541, 2010.
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann and Ian H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 2009.
- ACM SIGKDD Service Award 2005
Under Development/Roadmap
- Complete Knowledge Flow rewrite - new engine, refactored UI etc.
- PMML Support in Weka
- Distributed Weka for Hadoop and for Spark
- Incremental dictionary creation and vectorisation (StringToWordVector filter) for text documents
Archived
- Support for parallelism in ensemble learning (Bagging, Vote, RandomCommittee etc.)
- KnowledgeFlow plugin for Kettle (ETL + Data Mining)
- HotSpot algorithm for automatic segmentation/profiling
- Groovy scripting component for the KnowledgeFlow
- Exporting visualizations from Knowledge Flow processes