GoML, primarily a learning project, is a collection of simple ML models and interfacing tools. As of now, the package explicitly focuses on supervised regression algorithms (the scope of functionality will expand as development goes on) with 3 available models and ensemble capabilities. A CLI and a RESTful API is implemented to provide interfacing in local and remote access environments.
Linear Regression is a staple of entry-level ML algorithms with a very understandable operating logic and clear assumptions/limitations. Given a suitable regression task, it can be as powerful as complex models for a fraction of the resources. However, linear regression is a generally limited model and is rarely the correct choice for a real-life application where assumption of strict linearity are often shaky.
To make the most out of linear regression, focusing on feature selection is essential. This model is not known for its robustness, reducing multicollinearily, isolating linear correlations, and feature re-scaling are often very beneficial for model performance.
GoML's implementation of linear regression uses
The model is defined as:
type LinReg struct {
X [][]float64
Y []float64
Coefs []float64
metrics metrics.Metrics
}Where Coefs stores the coefficients determined at fit time and metrics.Metrics stores error metrics (localhost. (Tested with a slice of the publicly available boston housing dataset)
Ordinary Least Squares (OLS) is very closely related to Linear Regression. In GoML, OLS is differentiated from linear regression through the inclusion of an intercept parameter. It shares Linear Regression's limitations, but can fit better with features that are separated from the target by an additive constant in scale. Through the added intercept coefficient, OLS introduces an extra column to all matrix operations compared to linear regression, and the two is separated simply to keep linear regression as lean as possible to preserve minimum latency where possible. The fit equation of OLS can be written as:
OLS shares a similar model definition to linear regression but includes the added intercept parameter:
type OLS struct {
X [][]float64
Y []float64
Coefs []float64
Intercept float64
metrics metrics.Metrics
}The decision tree algorithm is a binary tree with probabilistic splits based on given feature values. A "split" in this context is a binary decision, set to use the most "optimal" feature for the given depth of each node in the tree. Intuitively, a Decision Tree is similar to a mathematically proven way of playing the "20 questions" game (a game where players ask 20 yes/no questions to guess something) to reach a final verdict. Traditionally, and in GoML's implementation, each split is made on a single feature, where traversing the tree from top-to-bottom yields the distinct route defined by results of all feature splits. This unique methodology to regression creates an inherent feature independence and makes the model more robust to scale differences and multicollinearity. By extension, there's no global target of minimization in a Decision Tree; instead, minimization occurs in splits, targeting feature impurity (how uniform are the target values in a split). By definition, Variance Reduction methods (used for split-wise optimization in regression tasks) decrease impurity by decreasing the average difference between the target values. In short, optimization is based on a compilation of local minima, rather than a "monolithic" global optimization target. Decision Trees (or Random Forests) are often the choice of model where feature behavior can be ambiguous or varying over-time due to this robustness. However, its important to note that this process often computationally inefficient and a model selected to specifically suit the given task can produce similar results in less compute power in most cases.
As there's no clear equation being minimized in the global scope of a Decison Tree, the model is presented in a slightly different fashion in GoML.
Feature importances (how often a features appears in splits) and the tree structure are often used to quantize the model into a digestible summary.
GoML also takes this route by exposing a tree structure as a string, and feature importances as percent values in the form DecTree struct as:
type DecTree struct {
X [][]float64
Y []float64
Metrics metrics.Metrics
root *Node // Standard tree node struct for binary trees
MaxDepth int
MinSamplesSplit int
MinSamplesLeaf int
MaxFeatures *int
RandomSeed *int64
rng *rand.Rand
}Ensemble estimators are created through combining multiple instances of identical base estimators. GoML implements two primary methods of ensemble generation.
Bagging is an ensemble method that takes advantage of a probabilistic law creating the basis of modern statistics.
Law of Large Numbers (LLN) is a law stating that increasing the number of samples in a set of observations will cause the sample mean to converge to the expected value.
More formally, we can denote an over-simplified
LLN presents itself in the form of error compensation. Bagging uses multiple, independent models fitted to randomly selected sub-samples of the input data.
Each model generates a prediction and a weighted average of the individual predictions is calculated, either by equal weights (common default) or by using an error metric (
GoML's Bagging struct is defined as:
type Bagged struct {
//Ensemble Components
Estimators []Estimator
Bags []Sample
weights []float64 // RMSE based
// Raw Data
X [][]float64
Y []float64
//Metrics
FitMetrics metrics.Metrics // Metrics at fit time
OOBMetrics metrics.Metrics // Metrics calculated through data that remained outside the sample for each bag
// Random State
RandSeed *int64
rng *rand.Rand
}If we define Bagging as a horizontal ensemble (increasing the count of identical models) Boosting is the exact opposite.
Boosting is built on the idea that one model's shortcomings can be compensated by another.
The Boosting process starts with a single estimator fitted to the data as usual. Of course, the initial estimator produces some error in the form of residuals (
GoML defines it's Boosting struct as:
type Boosted struct {
// Ensemble Components
Estimators []Estimator
Factory func(x [][]float64, y []float64) Estimator
LearningRate float64
// Raw Data
X [][]float64
Y []float64
// Metrics
Metrics metrics.Metrics
}