SMO: Optimisation without Gram Matrix Inversion

optimisation

Author

Passawis

Published

May 1, 2025

Sequential Matrix Optimisation

Unlike other optimisation algorithm that utilise gradients and require inversion of the Gram matrix, SMO breaks the optimisation into a sequence of analytically solvable 2-dimensional subproblems also known as decomposition method. This approach eliminates the need for computationally expensive matrix operations. A direct competitor to this could be the interior point methods, that also solves the dual, that has nice theories but it is difficult to implement and uses a lot of memory.

Problem Setup

For the sake of simplicity our problem can be a simple SVM with a Gaussian kernel.

$K (x_{k}, x_{l}) = e x p (- \frac{| | x_{k} - x_{l} | |}{2 σ^{2}})$

The dual form, derived from setting up the Lagrangian, is shown below:

$\begin{aligned} max_{α \in R^{n}} \sum_{i = 1}^{m} α_{k} - \frac{1}{2} \sum_{k = 1}^{m} & \sum_{l = 1}^{m} α_{k} α_{l} y_{k} y_{l} K (x_{k}, x_{l}) \\ subject to 0 \leq α_{k} & \leq C for all k=1,...,m \\ \sum_{k = 1}^{m} & α_{k} y_{k} = 0 \end{aligned}$

Algorithm

SMO solves the dual problem without matrix inversion by selecting two Lagrange multipliers at a time and solving a reduced QP over that pair. Much of the important part of the algorithm depends on what heuristics is used to select the pair or the working set. There are other heuristiscs from the literature but our case we focus on the heuristics that select the maximal violating pairs where the pair is selected as the one that maximise the violation of the KKT conditions. But first the simplified version, the version where the working pair is selected at random at each iteration is described.

At each iteration, SMO selects a working pair of Lagrange multipliers $(α_{i}, α_{j})$ and holds all other variables fixed. For now the simplified is that you select $α_{i}$ through a iteration loop and sample its pair $α_{j}$ randomly. Recall the dual problem’s equality constraint from our formulation:

$\sum_{k = 1}^{m} y_{k} α_{k} = 0$

implies that the updates to $(α_{i}$ \) and $(α_{j})$ must lie along a line of the form:

$y_{i} α_{i} + y_{j} α_{j} = ζ,$

for some constant $(ζ)$ . For clarity this comes from the equality constraint and $ζ$ is just the rest that is kept fixed as follows:

$\begin{aligned} y_{i} α_{i} + y_{j} α_{j} & + \sum_{k \neq i, j} α_{k} y_{k} = 0 \\ y_{i} α_{i} + y_{j} α_{j} & = - \sum_{k \neq i, j} α_{k} y_{k} \\ = ζ \end{aligned}$

Reduced Subproblem

The optimisation over the selected pair reduces to a constrained quadratic minimisation. Here we fixed $α_{i}$ and reduce the problem to one dimension constrained optimisation over $α_{j}$ , with $α_{i}$ adjusted according to maintain feasibility.: $min_{α_{j} \in [L, H]} \frac{1}{2} η α_{j}^{2} + G α_{j} + const,$ where:

$η = K (x_{i}, x_{i}) + K (x_{j}, x_{j}) - 2 K (x_{i}, x_{j})$ is the second derivative of the reduced objective (1)
$G$ depends on prediction errors ( $G = y_{j} (E_{i} - E_{j})$ ) where $(E_k = f(x_k) - y_k$ and $f (x_{t}) = \sum_{k \neq i, j} α_{k} y_{k} K (x_{k}, x_{t}) + b + α_{i} y_{i} K (x_{i}, x_{t}) + α_{j} y_{j} K (x_{j}, x_{t})$ (2)
(L, H) are bounds derived from box constraints and the equality constraint (3)
The solution is clipped to the interval ([L, H]) (4)

After solving for $(α_{2}$ , $α_{1}$ is updated to maintain the constraint $(y_{1} α_{1} + y_{2} α_{2} = ζ)$ .

(1) and (2)

The result is by substituting $y_{i} α_{i} + y_{j} α_{j} = ζ$ and rearrange it to be $α_{i} = \frac{ζ - α_{j} y_{j}}{y_{i}}$ then subtitute into the dual objective function with $α_{i}$ .

\begin{aligned} α_{i} + α_{j} - \frac{1}{2} (α_{i}^{2} K (x_{i}, x_{i}) + α_{j}^{2} K (x_{j}, x_{j}) + 2 α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})) \\ - \sum_{\begin{array}{c} k = 1 \\ k \neq i, j \end{array}}^{n} α_{k} y_{k} (α_{i} y_{i} K (x_{i}, x_{k}) + α_{j} y_{j} K (x_{j}, x_{k})) + const . \end{aligned}

The full derivation with different notation can be seen in Platt’s paper. But it is just expanding and algebra and after you will get the following linear and quadratic term. Here we drop everything that is not dependent on $α_{j}$ as $c o n s t$

$\begin{matrix} \underset{Linear Term}{\underset{⏟}{α_{j} [(1 - \frac{y_{j}}{y_{i}}) + ζ y_{j} (K (x_{i}, x_{i}) - K (x_{i}, x_{j})) + y_{j} \sum_{k \neq i, j} α_{k} y_{k} (K (x_{i}, x_{k}) - K (x_{j}, x_{k}))]}} \\ + \\ \underset{Quadratic Term}{\underset{⏟}{- \frac{1}{2} (K (x_{i}, x_{i}) + K (x_{j}, x_{j}) - 2 K (x_{i}, x_{j})) α_{j}^{2}}} \end{matrix}$

The linear coefficient $G$ is $[(1 - \frac{y_{j}}{y_{i}}) + ζ y_{j} (K (x_{i}, x_{i}) - K (x_{i}, x_{j})) + y_{j} \sum_{k \neq i, j} α_{k} y_{k} (K (x_{i}, x_{k}) - K (x_{j}, x_{k}))]$ , however this value in SMO literature is often derived using the form $y_{j} (E_{i} - E_{j})$ .

(3)

The bound $[L, H]$ here just comes from our box constraint in our original formulation $0 \leq α_{j} \leq C$ :

$y_{i} = y_{j},$ $L = m a x (0, α_{i} + α_{j} - C)$ and $H = m i n (C, α_{i} + α_{j})$
$y_{i} \neq y_{j},$ $L = m a x (0, α_{i} - α_{j})$ and $H = m i n (C, C + α_{j} + α_{i})$

(4)

The analytical update is as follows:

The unbounded update for $α_{j}$ is:

$α_{j}^{new} = α_{j}^{old} + \frac{y_{j} (E_{i} - E_{j})}{η}$ and is then clipped to satisfy the bounds ([L, H]), determined from $α_{i}, α_{j}$ and their respective labels. Having solved for $α_{j}^{n e w}$ we update

$α_{i}^{n e w} = α_{i}^{o l d} + y_{i} y_{j} (α_{j}^{o l d} - α_{j}^{n e w})$

Once new values are obtained, the bias term (b) is updated depending on whether either (_i) or (_j) lies strictly inside the interval ((0, C)).

$b := {\begin{cases} b_{i}, & 0 < α_{i} < C \\ b_{j}, & 0 < α_{j} < C \\ \frac{b_{i} + b_{j}}{2}, & otherwise \end{cases}$

$b_{i} = E_{i} + y_{i} (α_{i}^{new} - α_{i}^{old}) k (x_{i}, x_{i}) + y_{j} (α_{j}^{new, clipped} - α_{j}^{old}) k (x_{i}, x_{j}) + b^{old}$

$b_{j} = E_{j} + y_{i} (α_{i}^{new} - α_{i}^{old}) k (x_{j}, x_{i}) + y_{j} (α_{j}^{new, clipped} - α_{j}^{old}) k (x_{j}, x_{j}) + b^{old}$

Termination and KKT Conditions

SMO iteration continues until all Lagrange multipliers satisfy the KKT conditions up to a tolerance $ϵ$ :

${\begin{cases} α_{i} = 0 & \Rightarrow y_{i} f (x_{i}) \geq 1 - ϵ \\ 0 < α_{i} < C & \Rightarrow y_{i} f (x_{i}) \approx 1 \\ α_{i} = C & \Rightarrow y_{i} f (x_{i}) \leq 1 + ϵ \end{cases}$

Heuristic for Working Pair Selection: Maximal Violation

So far the working pair are selected randomly, but to accelerate convergence, we use a heuristic for selecting the working set known as the Maximal Violating Pair (WSS 1). Instead of sampling $α_{j}$ randomly:

Choose $(α_{i})$ with the largest KKT violation,
Pair it with $(α_{j})$ such that the prediction error difference $(E_{i} - E_{j})$ is maximised.

$\begin{array}{r} i \in \arg max_{t \in I_{up} (α^{k})} - y_{t} \nabla f (α^{k})_{t} (1) \\ j \in \arg min_{t \in I_{low} (α^{k})} - y_{t} \nabla f (α^{k})_{t} (2) \end{array}$

where:

$I_{up} (α^{k}) = {t ∣ α_{t} < C, y_{t} = 1 or α_{t} > 0, y_{t} = - 1},$ $I_{low} (α^{k}) = {t ∣ α_{t} < C, y_{t} = - 1 or α_{t} > 0, y_{t} = 1},$ $\nabla f (α^{k}) = Q α^{k} - e,$ with $(Q)$ as the kernel matrix adjusted by labels where $Q_{i j} = y_{i} y_{j} K (x_{i}, x_{j})$ and $e$ is a vector of ones. The selected ${i, j}$ is also known as the , and for WSS 1 is refered to as the two variable working set.

This heuristic selection ensures that at each iteration we target the most significant KKT violations whicih makes the updates lead to maximal reductions in the dual objective, speeding up convergence compared to just randomly sampling.

It is useful to know that these three are special cases of each other and that $W S S 1 \subset W S S 2 \subset W S S 3$ . So if the theorem concludes for $W S S 2$ and $W S S 3$ it holds in the case of $W S S 1$ . $M (α_{k})$ is max instead of argmax for $(1)$ and $m (α_{k})$ is min instead of argmin for $(2)$ .

For WSS 2 it allows any pair that violates KKT conditon by at least some factor $σ \in (0, 1]$ of the maximal violation. The pair $(α_{i}, α_{j})$ is seleccted such that

$y_{i} \nabla f (α^{k})_{i} + y_{j} \nabla f (α^{k})_{j} \geq σ (m (α^{k}) - M (α^{k})),$

For WSS 3 heuristic, the pair $((α_{i}, α_{j}))$ is selected such that:

$y_{i} \nabla f (α^{k})_{i} + y_{j} \nabla f (α^{k})_{j} \geq h (m (α^{k}) - M (α^{k})),$

where $(h : [0, \infty) \to [0, \infty))$ is a function satisfying $(h (x) > 0)$ for $(x > 0)$ , and $(h)$ is locally Lipschitz continuous at 0 with $(h (0) = 0)$ .