Datumbox Machine Learning Framework 0.7.0 Released

Framework Machine Learning & Statistics Programming

I am really excited to announce that, after several months of development, the new version of Datumbox is out! The 0.7.0 version brings multi-threading support, fast disk-based training for datasets that don’t fit in memory, several algorithmic enhancements and better architecture. Download it now from Github or Maven Central Repository.

What is new?

The focus of version 0.7.0 is to finally bring multi-threading support to the framework and make the disk-based training ultra fast. Moreover it brings several algorithmic enhancements in all the Regression-based algorithms, the Collaborative Filtering model and the N-grams extractor which is used in NLP applications. The architecture of the framework has been redesigned to separate the project into multiple modules (note that the artifactId of the main library is now datumbox-framework-lib) and to simplify its structure. Finally the new version brings several code improvements, better documentation in the form of javadocs and improved test coverage.

The 0.7.0 version of the framework is not backwards compatible with the 0.6.x branch. This is because major redevelopment was necessary in order to add the new features and improve & simplify the architecture of the framework. Below I discuss in detail the new features:

Multi-threading support

The new framework is several times faster than the 0.6.x branch. This was achieved by using threads, by doing heavy profiling on the hot spots of the code and by rewriting core components to enable non-blocking concurrent reads/writes. Currently threads are being used in all the algorithms that can be parallelized which is the majority of the supported models of the framework. The parallel execution is supported both during training and testing/predicting.

The project uses lots Java 8 features in order to reduce the verbosity of the code, improve readability and modernize the code-base. Note that even though the framework makes heavy use of streams, all tasks are executed in their own ForkJoinPool to ensure that they will not get stuck. The level of parallelism is controlled either by changing programmatically the ConcurrencyConfiguration object or by configuring the datumbox.config.properties file.

Disk-based Training

Even though disk-based training (training models without loading the data in memory) was possible since the 0.6.0 version, it was so slow that made the feature practically unusable. In version 0.7.0, the Storage Engine mechanism was redeveloped to enable a hybrid approach of storing the hot/regularly accessed records in memory & LRU cache while keeping the rest on disk. This approach makes the disk-based training very fast and it should be preferred even in cases where the data barely fit in memory (obviously if the data fit easily in RAM, the default in-memory training should be preferred). As in the previous version, the memory storage configuration can be changed programmatically by changing the appropriate DatabaseConfiguration objects or by configuring the datumbox.config.properties file.

At this point I would like to point out that this feature would not have been possible without the amazing work done by Jan Kotek on MapDB. MapDB is an embeded Java database engine which provides concurrent Maps backed by disk storage and off-heap-memory. Using his open-source library, I was able to develop a Storage Engine which enables Datumbox to handle several GB worth of training data on my laptop without loading them in memory.

Algorithmic Enhancements

The new version adds support of L1, L2 and ElasticNet regularization in the SoftMaxRegression (Multinomial Logistic Regresion), OrdinalRegression and NLMS (Linear Regression) models. This means that by using the same standard classes one can perform Ridge Regression, Lasso Regression or make use of Elastic Nets. Moreover in the new version the Collaborative Filtering algorithm was modified to support more generic User-user CF models. Finally the NgramsExtractor algorithm was rewritten to make it able to export more keywords and provide better scores.

Framework Architecture & Code Improvements

Another important update on the new framework is the fact that the project was split into multiple sub-modules. Below I list the currently supported modules named after their artifactIds:

datumbox-framework-common: It contains the most important interfaces, helper and utility classes, data structures and mechanisms of the framework. This module does not contain any algorithms but it is the base of the framework.
datumbox-framework-core: It consists of the 3 main layers of the framework (Machine Learning, Statistics and Mathematics) along with the utilities layer. This module contains all the algorithms, methods and statistical tests of the framework.
datumbox-framework-applications: It contains a list of classes which are build to offer off-the-shelf solutions for common machine learning problems such as Text Classification, Data Modelling etc. All the classes of the module are built on top of the core module.
datumbox-framework-lib: This is the Datumbox Machine Learning Framework! Note that the artifactId of the library changed from “datumbox-framework” to “datumbox-framework-lib” as a result of the restructuring.

In addition to the above modules, we have the “datumbox-framework” parent module which is no longer the Java library but simply groups together all the sub-modules under the same project. In order to use the new framework on Maven projects add in your pom.xml the following lines:

<dependencies>
   ...
   <dependency>
       <groupId>com.datumbox</groupId>
       <artifactId>datumbox-framework-lib</artifactId>
       <version>0.7.0</version>
   </dependency>
   ...
</dependencies>

The new version brings major changes on the structure of framework, the interfaces and inheritance with main goal to simplify and improve its architecture. One of the breaking changes that were introduced on the new framework is the deprecation of the old Dataset class (which was used to store all the training and testing data in the framework) and the introduction of the Dataframe class. The Dataframe class implements the Collection interface, allows the modification and deletion of records and enables the processing of the records in parallel. Another important change is the fact that the BaseMLrecommender, which is the base class for all Recommender System algorithms, now inherits from BaseMLmodel.

In addition to the above changes the framework includes some code enhancements and bug fixes: A serialVersionUID is added in every serializable class, the Exceptions and error messages have been improved and so do the javadocs documentation and the test-coverage. For more information about the updates of the new version have a look on the Changelog.

The new Roadmap

Datumbox 0.7.0 has completed several important milestones of the originally proposed roadmap. The development of the framework will continue in the following months to cover the following targets:

Access the Framework via Console or Python: The framework should become more accessible to non-Java developers. To achieve this it should provide access to the algorithms via the command line or by offering an API in other languages like Python.
New Machine Learning algorithms: As the architecture of the framework becomes more mature, it will be easier to increase the number of supported algorithms and include models such as Mixture of Gaussians, Gaussian Processes, k-NN, Decision Trees, Random Forests, Factor Analysis, SVD, Factorization Machines, Artificial Neural Networks etc.
More Storage Engines: More options should be offered to the users of the framework to store their models and train their algorithms without loading all the data in memory. Moreover better tools should be provided to those who want to move a model from one storage engine to the other.
Improve Documentation, Test coverage & Code examples: Even though the javadocs and test coverage improve in each release, the documentation of the framework is still poor. Next versions should provide a better documentation, better test-coverage and more examples on how to use the supported algorithms.

Given that I have a full-time job, I expect that the development of the framework will continue at the same rate, releasing a new version every 4-6 months. If you would like to propose a new milestone feel free to open an issue on the official Github repository. Last but not least, if you use the project please consider contributing. It does not matter if you are a ninja Java Developer, a rock-star Data Scientist or a power user of the library; I can use all the help I can get so feel free to get in touch with me.

Acknowledgements

Once again I would like to thank my friend and colleague Eleftherios Bampaletakis for helping me improve the architecture of the framework, his feedback was invaluable. Also I would like to thank Jan Kotek for offering free consulting on how to use efficiently MapDB and for open-sourcing such an amazing product. Moreover lots of thanks to ej-technologies GmbH and JetBrains for providing licenses for their amazing tools JProfiler and IntelliJ IDEA; they both offer amazing products that helped a lot the development of the framework. Last but not least, I’ll like to thank the love of my life, Kyriaki, for supporting and putting up with me while writing the project.

Don’t forget to clone the code of Datumbox v0.7.0 from Github. The library is available also on Maven Central Repository. Also have a look on the Detailed Installation Guide and on the Code Examples to find out more on how to use the framework.

I am looking forward to your comments and recommendations. Pull requests are always welcome! 🙂