SMO: Optimisation without Gram Matrix Inversion

optimisation
Author

Passawis

Published

May 1, 2025

Sequential Matrix Optimisation

Unlike other optimisation algorithm that utilise gradients and require inversion of the Gram matrix, SMO breaks the optimisation into a sequence of analytically solvable 2-dimensional subproblems also known as decomposition method. This approach eliminates the need for computationally expensive matrix operations. A direct competitor to this could be the interior point methods, that also solves the dual, that has nice theories but it is difficult to implement and uses a lot of memory.

Problem Setup

For the sake of simplicity our problem can be a simple SVM with a Gaussian kernel.

K(xk,xl)=exp(||xkxl||2σ2)

The dual form, derived from setting up the Lagrangian, is shown below:

maxαRni=1mαk12k=1ml=1mαkαlykylK(xk,xl) subject to 0αkC for all k=1,...,m k=1mαkyk=0

Algorithm

SMO solves the dual problem without matrix inversion by selecting two Lagrange multipliers at a time and solving a reduced QP over that pair. Much of the important part of the algorithm depends on what heuristics is used to select the pair or the working set. There are other heuristiscs from the literature but our case we focus on the heuristics that select the maximal violating pairs where the pair is selected as the one that maximise the violation of the KKT conditions. But first the simplified version, the version where the working pair is selected at random at each iteration is described.

At each iteration, SMO selects a working pair of Lagrange multipliers (αi,αj) and holds all other variables fixed. For now the simplified is that you select αi through a iteration loop and sample its pair αj randomly. Recall the dual problem’s equality constraint from our formulation:

k=1mykαk=0

implies that the updates to (αi\) and (αj) must lie along a line of the form:

yiαi+yjαj=ζ,

for some constant (ζ). For clarity this comes from the equality constraint and ζ is just the rest that is kept fixed as follows:

yiαi+yjαj+ki,jαkyk=0yiαi+yjαj=ki,jαkyk=ζ

Reduced Subproblem

The optimisation over the selected pair reduces to a constrained quadratic minimisation. Here we fixed αi and reduce the problem to one dimension constrained optimisation over αj, with αi adjusted according to maintain feasibility.: minαj[L,H]12ηαj2+Gαj+const, where:

  • η=K(xi,xi)+K(xj,xj)2K(xi,xj) is the second derivative of the reduced objective (1)

  • G depends on prediction errors (G=yj(EiEj)) where $(E_k = f(x_k) - y_k$ and f(xt)=ki,jαkykK(xk,xt)+b+αiyiK(xi,xt)+αjyjK(xj,xt) (2)

  • (L, H) are bounds derived from box constraints and the equality constraint (3)

  • The solution is clipped to the interval ([L, H]) (4)

After solving for (α2, α1 is updated to maintain the constraint (y1α1+y2α2=ζ).

(1) and (2)

The result is by substituting yiαi+yjαj=ζ and rearrange it to be αi=ζαjyjyi then subtitute into the dual objective function with αi.

αi+αj12(αi2K(xi,xi)+αj2K(xj,xj)+2αiαjyiyjK(xi,xj))k=1ki,jnαkyk(αiyiK(xi,xk)+αjyjK(xj,xk))+const.

The full derivation with different notation can be seen in Platt’s paper. But it is just expanding and algebra and after you will get the following linear and quadratic term. Here we drop everything that is not dependent on αj as const

αj[(1yjyi)+ζyj(K(xi,xi)K(xi,xj))+yjki,jαkyk(K(xi,xk)K(xj,xk))]Linear Term+12(K(xi,xi)+K(xj,xj)2K(xi,xj))αj2Quadratic Term

The linear coefficient G is [(1yjyi)+ζyj(K(xi,xi)K(xi,xj))+yjki,jαkyk(K(xi,xk)K(xj,xk))], however this value in SMO literature is often derived using the form yj(EiEj).

(3)

The bound [L,H] here just comes from our box constraint in our original formulation 0αjC:

  • yi=yj, L=max(0,αi+αjC) and H=min(C,αi+αj)
  • yiyj, L=max(0,αiαj) and H=min(C,C+αj+αi)

(4)

The analytical update is as follows:

The unbounded update for αj is:

αjnew=αjold+yj(EiEj)η and is then clipped to satisfy the bounds ([L, H]), determined from αi,αj and their respective labels. Having solved for αjnew we update

αinew=αiold+yiyj(αjoldαjnew)

Once new values are obtained, the bias term (b) is updated depending on whether either (_i) or (_j) lies strictly inside the interval ((0, C)).

b:={bi,0<αi<Cbj,0<αj<Cbi+bj2,otherwise

bi=Ei+yi(αinewαiold)k(xi,xi)+yj(αjnew, clippedαjold)k(xi,xj)+bold

bj=Ej+yi(αinewαiold)k(xj,xi)+yj(αjnew, clippedαjold)k(xj,xj)+bold

Termination and KKT Conditions

SMO iteration continues until all Lagrange multipliers satisfy the KKT conditions up to a tolerance ϵ:

{αi=0yif(xi)1ϵ0<αi<Cyif(xi)1αi=Cyif(xi)1+ϵ

Heuristic for Working Pair Selection: Maximal Violation

So far the working pair are selected randomly, but to accelerate convergence, we use a heuristic for selecting the working set known as the Maximal Violating Pair (WSS 1). Instead of sampling αj randomly:

  1. Choose (αi) with the largest KKT violation,
  2. Pair it with (αj) such that the prediction error difference (EiEj) is maximised.

iargmaxtIup(αk)ytf(αk)t(1)jargmintIlow(αk)ytf(αk)t(2)

where:

Iup(αk)={tαt<C,yt=1 or αt>0,yt=1}, Ilow(αk)={tαt<C,yt=1 or αt>0,yt=1}, f(αk)=Qαke, with (Q) as the kernel matrix adjusted by labels where Qij=yiyjK(xi,xj) and e is a vector of ones. The selected {i,j} is also known as the , and for WSS 1 is refered to as the two variable working set.

This heuristic selection ensures that at each iteration we target the most significant KKT violations whicih makes the updates lead to maximal reductions in the dual objective, speeding up convergence compared to just randomly sampling.

It is useful to know that these three are special cases of each other and that WSS 1 WSS 2 WSS 3 . So if the theorem concludes for WSS 2  and WSS 3  it holds in the case of WSS 1. M(αk) is max instead of argmax for (1) and m(αk) is min instead of argmin for (2).

For WSS 2 it allows any pair that violates KKT conditon by at least some factor σ(0,1] of the maximal violation. The pair (αi,αj) is seleccted such that

yif(αk)i+yjf(αk)jσ(m(αk)M(αk)),

For WSS 3 heuristic, the pair ((αi,αj)) is selected such that:

yif(αk)i+yjf(αk)jh(m(αk)M(αk)),

where (h:[0,)[0,)) is a function satisfying (h(x)>0) for (x>0), and (h) is locally Lipschitz continuous at 0 with (h(0)=0).