SMO: Optimisation without Gram Matrix Inversion
Sequential Matrix Optimisation
Unlike other optimisation algorithm that utilise gradients and require inversion of the Gram matrix, SMO breaks the optimisation into a sequence of analytically solvable 2-dimensional subproblems also known as decomposition method. This approach eliminates the need for computationally expensive matrix operations. A direct competitor to this could be the interior point methods, that also solves the dual, that has nice theories but it is difficult to implement and uses a lot of memory.
Problem Setup
For the sake of simplicity our problem can be a simple SVM with a Gaussian kernel.
The dual form, derived from setting up the Lagrangian, is shown below:
Algorithm
SMO solves the dual problem without matrix inversion by selecting two Lagrange multipliers at a time and solving a reduced QP over that pair. Much of the important part of the algorithm depends on what heuristics is used to select the pair or the working set. There are other heuristiscs from the literature but our case we focus on the heuristics that select the maximal violating pairs where the pair is selected as the one that maximise the violation of the KKT conditions. But first the simplified version, the version where the working pair is selected at random at each iteration is described.
At each iteration, SMO selects a working pair of Lagrange multipliers
implies that the updates to
for some constant
Reduced Subproblem
The optimisation over the selected pair reduces to a constrained quadratic minimisation. Here we fixed
is the second derivative of the reduced objective (1) depends on prediction errors ( ) where $(E_k = f(x_k) - y_k$ and (2)(L, H) are bounds derived from box constraints and the equality constraint (3)
The solution is clipped to the interval ([L, H]) (4)
After solving for
(1) and (2)
The result is by substituting
The full derivation with different notation can be seen in Platt’s paper. But it is just expanding and algebra and after you will get the following linear and quadratic term. Here we drop everything that is not dependent on
The linear coefficient
(3)
The bound
and and
(4)
The analytical update is as follows:
The unbounded update for
Once new values are obtained, the bias term (b) is updated depending on whether either (_i) or (_j) lies strictly inside the interval ((0, C)).
Termination and KKT Conditions
SMO iteration continues until all Lagrange multipliers satisfy the KKT conditions up to a tolerance
Heuristic for Working Pair Selection: Maximal Violation
So far the working pair are selected randomly, but to accelerate convergence, we use a heuristic for selecting the working set known as the Maximal Violating Pair (WSS 1). Instead of sampling
- Choose
with the largest KKT violation, - Pair it with
such that the prediction error difference is maximised.
where:
This heuristic selection ensures that at each iteration we target the most significant KKT violations whicih makes the updates lead to maximal reductions in the dual objective, speeding up convergence compared to just randomly sampling.
It is useful to know that these three are special cases of each other and that
For WSS 2 it allows any pair that violates KKT conditon by at least some factor
For WSS 3 heuristic, the pair
where