Multivariate Optimization: Gradient Descent to Newton-Raphson

BusinessMath Quarterly Series

12 min read

Part 26 of 12-Week BusinessMath Series


What You’ll Learn

  • Understanding numerical differentiation for gradients and Hessians
  • Using gradient descent with momentum and Nesterov acceleration
  • Applying Newton-Raphson and BFGS quasi-Newton methods
  • Choosing the right optimization algorithm for your problem
  • Using AdaptiveOptimizer for automatic algorithm selection
  • Understanding convergence rates and algorithm trade-offs

The Problem

Real-world optimization problems have multiple variables:

  • Parameter fitting: Minimize error across 10+ parameters
  • Cost minimization: Optimize production mix across multiple products/facilities
  • Portfolio construction: Find optimal weights for N assets
  • Machine learning: Fit models with thousands of parameters

Single-variable methods (like goal-seeking) don’t extend to multivariate problems—you need algorithms designed for N dimensions.


The Solution

BusinessMath provides a progression of multivariate optimizers, from simple gradient descent to sophisticated second-order methods. All work generically with any VectorSpace type.

Numerical Differentiation

When you can’t compute derivatives analytically, BusinessMath computes them numerically:

import BusinessMath

// Define f(x,y) = x² + 2y²
let function: (VectorN) -> Double = { v in
    let x = v[0]
    let y = v[1]
    return x*x + 2*y*y
}

// Compute gradient at (1, 2)
let point = VectorN([1.0, 2.0])
let gradient = try numericalGradient(function, at: point)
print("Gradient: \(gradient.toArray())")  // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓

// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function, at: point)
print("Hessian:")
for row in hessian {
    print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]

Output:

Gradient: [2.0, 8.0]
Hessian:
[2.0, 0.0]
[0.0, 4.0]

How it works:

  • Gradient: Central finite differences (f(x+h) - f(x-h)) / 2h
  • Hessian: Second-order finite differences (N² function evaluations)

Gradient Descent: The Workhorse

Gradient descent iteratively moves in the direction of steepest descent:

import BusinessMath

// Minimize f(x,y) = x² + 4y²
let function: (VectorN) -> Double = { v in
    v[0]*v[0] + 4*v[1]*v[1]
}

// Basic gradient descent
let optimizer = MultivariateGradientDescent>(
    learningRate: 0.01,
    maxIterations: 1000,
    tolerance: 1e-6
)

let result = try optimizer.minimize(
    function: function,
    gradient: { x in try numericalGradient(function, at: x) },
    initialGuess: VectorN([5.0, 5.0])
)

print("Minimum at: \(result.solution.toArray().map({ $0.number(3) }))")
print("Value: \(result.objectiveValue.number(6))")
print("Iterations: \(result.iterations)")
print("Converged: \(result.converged)")

Output:

Minimum at: [0.0, 0.0]
Value: 0.000000
Iterations: 247
Converged: true

Gradient Descent with Momentum

Momentum accelerates convergence and reduces oscillations:

import BusinessMath

// Rosenbrock function: classic test problem
let rosenbrock: (VectorN) -> Double = { v in
    let x = v[0], y = v[1]
    let a = 1 - x
    let b = y - x*x
    return a*a + 100*b*b  // Minimum at (1, 1)
}

// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer(
	learningRate: 0.01,
	maxIterations: 5000,
	momentum: 0.9,
	useNesterov: false
)

// Note: Using scalar optimizer for demonstration
// For VectorN, use MultivariateGradientDescent

let result = optimizerWithMomentum.optimize(
    objective: { x in (x - 5) * (x - 5) },
    constraints: [],
    initialGuess: 0.0,
    bounds: nil
)

print("Converged to: \(result.optimalValue.number(4))")
print("Iterations: \(result.iterations)")

Nesterov Acceleration (look-ahead gradient) often converges even faster:

let nesterovOptimizer = GradientDescentOptimizer(
    learningRate: 0.01,
    momentum: 0.9,
    useNesterov: true  // Nesterov acceleration
)

Newton-Raphson: Quadratic Convergence

Newton-Raphson uses second-order information (Hessian) for much faster convergence:

import BusinessMath

// Quadratic function: f(x,y) = x² + 4y² + 2xy
let quadratic: (VectorN) -> Double = { v in
    let x = v[0], y = v[1]
    return x*x + 4*y*y + 2*x*y
}

// Full Newton-Raphson (uses exact Hessian)
let newtonOptimizer = MultivariateNewtonRaphson>(
    maxIterations: 100,
    tolerance: 1e-8,
    useLineSearch: true
)

let result = try newtonOptimizer.minimize(
    function: quadratic,
    gradient: { try numericalGradient(quadratic, at: $0) },
    hessian: { try numericalHessian(quadratic, at: $0) },
    initialGuess: VectorN([10.0, 10.0])
)

print("Solution: \(result.solution.toArray().map({ $0.number(6) }))")
print("Converged in: \(result.iterations) iterations")

Output:

Solution: [0.000000, 0.000000]
Converged in: 3 iterations

The power: Newton-Raphson found the minimum in 3 iterations vs. 247 for gradient descent!


BFGS: Quasi-Newton Sweet Spot

BFGS approximates the Hessian, giving Newton-like speed without expensive Hessian computation:

import BusinessMath

let rosenbrock: (VectorN) -> Double = { v in
    let x = v[0], y = v[1]
    let a = 1 - x
    let b = y - x*x
    return a*a + 100*b*b
}

// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson>()

let result = try bfgsOptimizer.minimizeBFGS(
    function: rosenbrock,
    gradient: { try numericalGradient(rosenbrock, at: $0) },
    initialGuess: VectorN([0.0, 0.0])
)

print("Solution: \(result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))")
print("Iterations: \(result.iterations)")
print("Final value: \(result.objectiveValue.rounded(toPlaces: 8))")

Output:

Solution: [1.0000, 1.0000]
Iterations: 24
Final value: 0.00000001

Comparison:

MethodIterationsFunction EvalsSpeed
Gradient Descent4,782~10,000Slow
Momentum/Nesterov1,200~2,500Medium
Full Newton12~150Very Fast
BFGS24~50Fast

The trade-off: BFGS balances speed and computational cost—best for most practical problems.


AdaptiveOptimizer: Automatic Algorithm Selection

Don’t know which algorithm to use? Let AdaptiveOptimizer decide:

import BusinessMath

// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer = AdaptiveOptimizer>()

let rosenbrock: (VectorN) -> Double = { v in
    let x = v[0], y = v[1]
    return (1-x)*(1-x) + 100*(y-x*x)*(y-x*x)
}

let result = try optimizer.optimize(
    objective: rosenbrock,
    initialGuess: VectorN([0.0, 0.0]),
    constraints: []
)

print("Solution: \(result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))")
print("Algorithm used: \(result.algorithmUsed ?? "N/A")")
print("Reason: \(result.selectionReason ?? "N/A")")

Output:

Solution: [1.0000, 1.0000]
Algorithm used: Newton-Raphson
Reason: Small problem (2 variables) - using Newton-Raphson for fast convergence

How it works:

  • Analyzes problem size, constraints, gradient availability
  • Selects: Gradient Descent, Newton-Raphson, BFGS, or Constrained optimizer
  • Reports which algorithm was chosen and why

Choosing the Right Algorithm

AlgorithmSpeedStabilityMemoryBest For
Gradient DescentSlowHighLowNoisy functions, large-scale (10K+ vars)
MomentumMediumMediumLowSmooth landscapes, valleys
NesterovFastMediumLowConvex problems
Full NewtonVery FastLowHighSmall, smooth quadratic problems
BFGSFastHighMediumMost practical problems (recommended)
AdaptiveOptimizerVariesHighMediumUnknown problem characteristics

Rule of thumb:

  • < 100 variables + smooth: Use BFGS
  • 100-10,000 variables: Use Momentum/Nesterov
  • > 10,000 variables: Use basic Gradient Descent
  • Constraints: Use ConstrainedOptimizer or InequalityOptimizer (Week 8 Wednesday)

Real-World Example: Parameter Fitting

Fit a curve to noisy data:

import BusinessMath

// Data: y = a*x² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
    2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5...5)
}

// Objective: Minimize sum of squared errors
let objective: (VectorN) -> Double = { params in
    let a = params[0], b = params[1], c = params[2]
    var sse = 0.0
    for i in 0..>()
let result = try optimizer.minimizeBFGS(
    function: objective,
    gradient: { try numericalGradient(objective, at: $0) },
	initialGuess: VectorN([1.0, 2.0, 3.0])
)

print("Fitted parameters:")
print("  a = \(result_params.solution[0].number(2)) (true: 2.0)")
print("  b = \(result_params.solution[1].number(2)) (true: 3.0)")
print("  c = \(result_params.solution[2].number(2)) (true: 5.0)")
print("SSE: \(result_params.objectiveValue.number(1))")

Output:

Fitted parameters:
  a = 1.98 (true: 2.0)
  b = 3.17 (true: 3.0)
  c = 4.82 (true: 5.0)
SSE: 311.7

Try It Yourself

Click to expand full playground code
import BusinessMath

// MARK: - Numerical Differentiation

// Define f(x,y) = x² + 2y²
let function_nd: (VectorN) -> Double = { v in
	let x = v[0]
	let y = v[1]
	return x*x + 2*y*y
}

// Compute gradient at (1, 2)
let point_nd = VectorN([1.0, 2.0])
let gradient_nd = try numericalGradient(function_nd, at: point_nd)
print("Gradient: \(gradient_nd.toArray())")  // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓

// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function_nd, at: point_nd)
print("Hessian:")
for row in hessian {
	print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]

// MARK: - Gradient Descent

// Minimize f(x,y) = x² + 4y²
let function_gd: (VectorN) -> Double = { v in
	v[0]*v[0] + 4*v[1]*v[1]
}

// Basic gradient descent
let optimizer_gd = MultivariateGradientDescent>(
	learningRate: 0.01,
	maxIterations: 1000,
	tolerance: 1e-6
)

let result_gd = try optimizer_gd.minimize(
	function: function_gd,
	gradient: { x in try numericalGradient(function_gd, at: x) },
	initialGuess: VectorN([5.0, 5.0])
)

print("Minimum at: \(result_gd.solution.toArray().map({ $0.number(3)  }))")
print("Value: \(result_gd.objectiveValue.number(6))")
print("Iterations: \(result_gd.iterations)")
print("Converged: \(result_gd.converged)")

// MARK: Gradient Descent with Momentum

// Rosenbrock function: classic test problem
let rosenbrock: (VectorN) -> Double = { v in
	let x = v[0], y = v[1]
	let a = 1 - x
	let b = y - x*x
	return a*a + 100*b*b  // Minimum at (1, 1)
}

// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer(
	learningRate: 0.01,
	maxIterations: 5000,
	momentum: 0.9,
	useNesterov: false
)

// Note: Using scalar optimizer for demonstration
// For VectorN, use MultivariateGradientDescent

let result_gdm = optimizerWithMomentum.optimize(
	objective: { x in (x - 5) * (x - 5) },
	constraints: [],
	initialGuess: 0.0,
	bounds: nil
)

print("Converged to: \(result_gdm.optimalValue.number(1))")
print("Iterations: \(result_gdm.iterations)")

// MARK: - Newton-Raphson: Quadratic Convergence

	// Quadratic function: f(x,y) = x² + 4y² + 2xy
	let quadratic: (VectorN) -> Double = { v in
		let x = v[0], y = v[1]
		return x*x + 4*y*y + 2*x*y
	}

	// Full Newton-Raphson (uses exact Hessian)
	let newtonOptimizer = MultivariateNewtonRaphson>(
		maxIterations: 100,
		tolerance: 1e-8,
		useLineSearch: true
	)

	let result_newton = try newtonOptimizer.minimize(
		function: quadratic,
		gradient: { try numericalGradient(quadratic, at: $0) },
		hessian: { try numericalHessian(quadratic, at: $0) },
		initialGuess: VectorN([10.0, 10.0])
	)

	print("Solution: \(result_newton.solution.toArray().map({ $0.number(6) }))")
	print("Converged in: \(result_newton.iterations) iterations")

// MARK: - BFGS: Quasi-Newton Sweet Spot

// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson>()

let result_bfgs = try bfgsOptimizer.minimizeBFGS(
	function: rosenbrock,
	gradient: { try numericalGradient(rosenbrock, at: $0) },
	initialGuess: VectorN([0.0, 0.0])
)

print("Solution: \(result_bfgs.solution.toArray().map({ $0.number(4) }))")
print("Iterations: \(result_bfgs.iterations)")
print("Final value: \(result_bfgs.objectiveValue.number(8))")

// MARK: - Adaptive Optimizer

// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer_adaptive = AdaptiveOptimizer>()

let result_adaptive = try optimizer_adaptive.optimize(
	objective: rosenbrock,
	initialGuess: VectorN([0.0, 0.0]),
	constraints: []
)

print("Solution: \(result_adaptive.solution.toArray().map({ $0.number(4) }))")
print("Algorithm used: \(result_adaptive.algorithmUsed)")
print("Reason: \(result_adaptive.selectionReason)")


// MARK: - Parameter Fitting Example

// Data: y = a*x² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
	2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5...5)
}

// Objective: Minimize sum of squared errors
let objective_params: (VectorN) -> Double = { params in
	let a = params[0], b = params[1], c = params[2]
	var sse = 0.0
	for i in 0..>()
let result_params = try optimizer_params.minimizeBFGS(
	function: objective_params,
	gradient: { try numericalGradient(objective_params, at: $0) },
	initialGuess: VectorN([1.0, 2.0, 3.0])
)

print("Fitted parameters:")
print("  a = \(result_params.solution[0].number(2)) (true: 2.0)")
print("  b = \(result_params.solution[1].number(2)) (true: 3.0)")
print("  c = \(result_params.solution[2].number(2)) (true: 5.0)")
print("SSE: \(result_params.objectiveValue.number(1))")

→ Full API Reference: BusinessMath Docs – 5.5 Multivariate Optimization

Modifications to try:

  1. Compare convergence rates: plot iteration vs. objective value for each algorithm
  2. Fit a 10-parameter model (polynomial, exponential, custom function)
  3. Optimize a high-dimensional problem (100+ variables) with Momentum
  4. Test robustness: start from different initial guesses and compare results

Real-World Application

  • Machine learning: Train models by minimizing loss functions
  • Engineering: Optimize design parameters (aerodynamics, materials, structures)
  • Finance: Calibrate option pricing models to market data
  • Operations: Optimize production schedules, routing, inventory

Data scientist use case: “I need to fit a pricing model with 15 parameters to historical transaction data. Manual tuning is infeasible—I need automated optimization.”

BFGS converges in seconds, not hours of manual tweaking.


★ Insight ─────────────────────────────────────

Why BFGS Beats Full Newton for Most Problems

Full Newton requires computing the Hessian (N² second derivatives). For a 100-variable problem:

  • Full Newton: 10,000 Hessian elements per iteration
  • BFGS: 0 Hessian evaluations (approximated from gradients)

BFGS maintains a Hessian approximation updated using gradient changes:

H_{k+1} = H_k + correction terms based on (∇f_{k+1} - ∇f_k)

Result: BFGS gets ~90% of Newton’s convergence speed with ~10% of the cost.

When Full Newton wins: Very small problems (< 10 variables) where Hessian computation is cheap and you need extreme precision.

─────────────────────────────────────────────────


📝 Development Note

The hardest challenge was making numerical differentiation robust across all Real types (Float, Double, Decimal).

Problem: The step size h for (f(x+h) - f(x-h)) / 2h must be:

  • Small enough to approximate the true derivative
  • Large enough to avoid catastrophic cancellation (subtracting nearly equal floating-point numbers)

Solution: Adaptive step size h = √ε × max(|x|, 1) where ε is machine epsilon:

  • Float (ε ≈ 10⁻⁷): h ≈ 10⁻³
  • Double (ε ≈ 10⁻¹⁶): h ≈ 10⁻⁸
  • Decimal (ε ≈ 10⁻²⁸): h ≈ 10⁻¹⁴

This automatically adjusts to the numeric type’s precision.

Related Methodology: Numerical Stability (Week 2) - Covered machine epsilon and catastrophic cancellation.


Next Steps

Coming up Wednesday: Constrained Optimization - Lagrange multipliers, augmented Lagrangian methods, and handling equality/inequality constraints.


Series Progress:

  • Week: 8/12
  • Posts Published: 26/~48
  • Playgrounds: 22 available


Tagged with: businessmath, swift, optimization, gradient-descent, bfgs, newton-raphson, numerical-methods