Optimization with curves instead of lines

Convex optimization is the broadest class of optimization that stays in the cheap category. Most machine learning training fits here.

Convex optimization extends linear programming with curved objectives, as long as the curve goes the right way. "Convex" means the objective looks like a bowl — every local minimum is the global minimum. This structure makes the problem solvable in polynomial time, even when the objective is not linear. Most regularized statistical models (ridge regression, lasso, support vector machines, logistic regression) are convex. Portfolio optimization with risk penalties is convex. Many control problems are convex. Rust's clarabel library is a production-grade convex solver. When the problem is convex, the engineering estimate is days. When the problem is non-convex (most deep learning training), the toolkit changes to heuristics with no global guarantees. Knowing whether your objective is convex changes the engineering plan.

Ridge regression is the easiest convex problem to recognize — a quadratic loss plus an L2 penalty. The closed-form solution is one linear-system solve; nalgebra handles it. For larger problems the same setup goes to argmin or clarabel without any change of shape.

use nalgebra::{DMatrix, DVector};

fn ridge(x: &DMatrix<f64>, y: &DVector<f64>, lambda: f64) -> DVector<f64> {
    let n = x.ncols();
    let xt = x.transpose();
    let a = &xt * x + DMatrix::identity(n, n) * lambda;
    let b = xt * y;
    a.lu().solve(&b).expect("ridge is always solvable for lambda > 0")
}

DMatrix<T> and DVector<T> come from the nalgebra crate. The leading "D" stands for dynamic — the dimensions (rows × columns for a matrix, length for a vector) are decided at runtime, and the storage lives on the heap. Both let you write matrix algebra with operator overloading (+, -, * on whole matrices) the way the math reads. nalgebra also ships statically-sized variants — Matrix3<T>, Vector3<T>, SMatrix<T, R, C> — where the dimensions are encoded in the type and the storage lives inline (no allocation, slightly faster). Use the D-variants when the size depends on input data (a regression whose feature count comes from the data); use the static variants when the dimensions are fixed up front (a 3D position, a 4×4 transform, a small kernel).

use nalgebra::{DMatrix, DVector};

// DMatrix — dynamically-sized matrix, heap-allocated:
let x: DMatrix<f64> = DMatrix::from_row_slice(3, 2, &[
    1.0, 2.0,
    3.0, 4.0,
    5.0, 6.0,
]);
//                                            ─  ─
//                                            │  └── 2 columns
//                                            └───── 3 rows
//
// Three samples (rows), two features per sample (columns).

// DVector — dynamically-sized column vector, heap-allocated:
let y: DVector<f64> = DVector::from_vec(vec![1.0, 2.0, 3.0]);
//                                            ─────────────
//                                            3 elements
//
// One target per sample, three samples total.

// Operations look like the math:
let xt:     DMatrix<f64> = x.transpose();          // X^T
let xtx:    DMatrix<f64> = &xt * &x;               // X^T X (matrix product)
let xty:    DVector<f64> = &xt * &y;               // X^T y (matrix · vector)
let scaled: DMatrix<f64> = &x * 2.0;               // scalar multiplication


// Statically-sized variants — dimensions in the type, no allocation:
use nalgebra::{Matrix3, Vector3};

let m: Matrix3<f64> = Matrix3::new(
    1.0, 2.0, 3.0,
    4.0, 5.0, 6.0,
    7.0, 8.0, 9.0,
);
let v: Vector3<f64> = Vector3::new(1.0, 2.0, 3.0);


// When to use which:
//
//   DMatrix / DVector       size depends on input data.
//                            regression, dynamic graphs, runtime-shaped maps.
//
//   Matrix3, Vector3,        size fixed at compile time.
//   SMatrix<T, R, C>         3D geometry, fixed transforms, small kernels.

DMatrix::identity(n, n) constructs the n×n identity matrix — an n×n matrix with 1.0 on the diagonal and 0.0 everywhere else. The name comes from its algebraic role: multiplying any matrix or vector by the identity leaves it unchanged. I × A = A, the same way multiplying a number by 1 leaves it unchanged. The identity is the matrix-algebra equivalent of the number 1 in ordinary arithmetic. Here it appears in the ridge-regression formula: adding λI to X^T X is the regularization term. Without it, X^T X can be singular (no inverse) and the linear system has no unique solution. Adding a small multiple of the identity bumps every diagonal entry up by λ, which guarantees a clean solve and simultaneously penalizes large coefficient values. Small λ ≈ ordinary least squares; large λ shrinks the solution toward zero.

// The 3×3 identity:
//
//                  ┌  1  0  0  ┐
//          I_3  =  │  0  1  0  │
//                  └  0  0  1  ┘
//
// In nalgebra:
let i: DMatrix<f64> = DMatrix::identity(3, 3);

// The defining property — multiplying by identity changes nothing:
let a: DMatrix<f64> = DMatrix::from_row_slice(3, 3, &[
    2.0, 1.0, 3.0,
    4.0, 0.0, 1.0,
    1.0, 1.0, 1.0,
]);
let same = &i * &a;          // == a
let same = &a * &i;          // == a

// On page 20, identity is the regularizer for X^T X:
//
//   without regularization:  a  =  X^T X
//                            X^T X can be singular — no inverse exists,
//                            and the linear system has no unique solution.
//
//   ridge regularization:    a  =  X^T X  +  λI
//                            every diagonal entry bumped up by λ.
//                            guarantees a clean solve, and the λ knob
//                            trades off fit against coefficient size.

// The "1" of matrix algebra.  Just as x * 1 == x in ordinary arithmetic,
// A · I == A in matrix algebra.  Identity is the multiplicative identity
// element for matrix multiplication.

In plain terms

Convex optimization is the broadest class of optimization that is still in the cheap category. Linear programming is the simplest convex case. Quadratic, second-order-cone, and semidefinite programming are progressively more expressive but stay in the convex world.

The key property is that the objective is shaped like a bowl — there is one bottom, and any local minimum is the global minimum. This property removes the search problem. Standard iterative algorithms (interior-point methods, projected gradient, Newton's method) converge reliably to the answer. No hyperparameter tuning, no restart strategies, no luck required.

Most regularized statistical models are convex. Ridge regression, lasso, support vector machines, logistic regression — all convex optimization with a one-line solver call in any decent library. Portfolio optimization with mean-variance trade-offs is convex. Many control problems (LQR, MPC) are convex.

The boundary that matters: when the problem is convex, the engineering estimate is days to a week. When the problem is non-convex — most deep learning training, most genuinely hard optimization — the toolkit shifts to heuristics (stochastic gradient descent, simulated annealing, evolutionary search) with no global guarantees and substantial engineering effort.

In Rust, the clarabel library is the production-grade convex solver. It handles linear, quadratic, and conic problems, all in polynomial time. For pure gradient-based optimization on more general (possibly non-convex) problems, the argmin crate provides a framework for gradient descent, L-BFGS, Newton, and several derivative-free methods.

The questions to ask when sizing an optimization feature: is the objective convex? Are the constraints linear or convex? If both, the work is cheap and a solver does the job. If not, you are in heuristic territory and the estimate triples.

Convexity is the structural property that buys you tractability. Confirm it before sizing the work.