Multivariate Optimization: Gradient Descent to Newton-Raphson

BusinessMath Quarterly Series

13 min read

Part 26 of 12-Week BusinessMath Series


What You’ll Learn


The Problem

Real-world optimization problems have multiple variables: Single-variable methods (like goal-seeking) don’t extend to multivariate problems—you need algorithms designed for N dimensions.

The Solution

BusinessMath provides a progression of multivariate optimizers, from simple gradient descent to sophisticated second-order methods. All work generically with any VectorSpace type.

Numerical Differentiation

When you can’t compute derivatives analytically, BusinessMath computes them numerically:
import BusinessMath

// Define f(x,y) = x² + 2y²
let function: (VectorN ) -> Double = { v in
let x = v[0]
let y = v[1]
return x x + 2y y
}

// Compute gradient at (1, 2)
let point = VectorN([1.0, 2.0])
let gradient = try numericalGradient(function, at: point)
print(“Gradient: (gradient.toArray())”) // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓

// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function, at: point)
print(“Hessian:”)
for row in hessian {
print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]
Output:
Gradient: [2.0, 8.0]
Hessian:
[2.0, 0.0]
[0.0, 4.0]
How it works:
  • Gradient: Central finite differences (f(x+h) - f(x-h)) / 2h
  • Hessian: Second-order finite differences (N² function evaluations)

Gradient Descent: The Workhorse

Gradient descent iteratively moves in the direction of steepest descent:
import BusinessMath

// Minimize f(x,y) = x² + 4y²
let function: (VectorN ) -> Double = { v in
v[0] v[0] + 4v[1] v[1]
}

// Basic gradient descent
let optimizer = MultivariateGradientDescent >(
learningRate: 0.01,
maxIterations: 1000,
tolerance: 1e-6
)

let result = try optimizer.minimize(
function: function,
gradient: { x in try numericalGradient(function, at: x) },
initialGuess: VectorN([5.0, 5.0])
)

print(“Minimum at: (result.solution.toArray().map({ $0.number(3) }))”)
print(“Value: (result.objectiveValue.number(6))”)
print(“Iterations: (result.iterations)”)
print(“Converged: (result.converged)”)
Output:
Minimum at: [0.0, 0.0]
Value: 0.000000
Iterations: 247
Converged: true

Gradient Descent with Momentum

Momentum accelerates convergence and reduces oscillations:
import BusinessMath

// Rosenbrock function: classic test problem
let rosenbrock: (VectorN ) -> Double = { v in
let x = v[0], y = v[1]
let a = 1 - x
let b = y - x
x
return a
a + 100bb // Minimum at (1, 1)
}

// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer (
learningRate: 0.01,
maxIterations: 5000,
momentum: 0.9,
useNesterov: false
)

// Note: Using scalar optimizer for demonstration
// For VectorN, use MultivariateGradientDescent

let result = optimizerWithMomentum.optimize(
objective: { x in (x - 5) * (x - 5) },
constraints: [],
initialGuess: 0.0,
bounds: nil
)

print(“Converged to: (result.optimalValue.number(4))”)
print(“Iterations: (result.iterations)”)
Nesterov Acceleration (look-ahead gradient) often converges even faster:
let nesterovOptimizer = GradientDescentOptimizer
            
              (
              
learningRate: 0.01,
momentum: 0.9,
useNesterov: true // Nesterov acceleration
)

Newton-Raphson: Quadratic Convergence

Newton-Raphson uses second-order information (Hessian) for much faster convergence:
import BusinessMath

// Quadratic function: f(x,y) = x² + 4y² + 2xy
let quadratic: (VectorN ) -> Double = { v in
let x = v[0], y = v[1]
return x x + 4y y + 2x y
}

// Full Newton-Raphson (uses exact Hessian)
let newtonOptimizer = MultivariateNewtonRaphson >(
maxIterations: 100,
tolerance: 1e-8,
useLineSearch: true
)

let result = try newtonOptimizer.minimize(
function: quadratic,
gradient: { try numericalGradient(quadratic, at: $0) },
hessian: { try numericalHessian(quadratic, at: $0) },
initialGuess: VectorN([10.0, 10.0])
)

print(“Solution: (result.solution.toArray().map({ $0.number(6) }))”)
print(“Converged in: (result.iterations) iterations”)
Output:
Solution: [0.000000, 0.000000]
Converged in: 3 iterations
The power: Newton-Raphson found the minimum in 3 iterations vs. 247 for gradient descent!

BFGS: Quasi-Newton Sweet Spot

BFGS approximates the Hessian, giving Newton-like speed without expensive Hessian computation:
import BusinessMath

let rosenbrock: (VectorN ) -> Double = { v in
let x = v[0], y = v[1]
let a = 1 - x
let b = y - x
x
return aa + 100bb
}

// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson >()

let result = try bfgsOptimizer.minimizeBFGS(
function: rosenbrock,
gradient: { try numericalGradient(rosenbrock, at: $0) },
initialGuess: VectorN([0.0, 0.0])
)

print(“Solution: (result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))”)
print(“Iterations: (result.iterations)”)
print(“Final value: (result.objectiveValue.rounded(toPlaces: 8))”)
Output:
Solution: [1.0000, 1.0000]
Iterations: 24
Final value: 0.00000001
Comparison:
Method Iterations Function Evals Speed
Gradient Descent 4,782 ~10,000 Slow
Momentum/Nesterov 1,200 ~2,500 Medium
Full Newton 12 ~150 Very Fast
BFGS 24 ~50 Fast
The trade-off: BFGS balances speed and computational cost—best for most practical problems.

AdaptiveOptimizer: Automatic Algorithm Selection

Don’t know which algorithm to use? Let AdaptiveOptimizer decide:
import BusinessMath

// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer = AdaptiveOptimizer >()

let rosenbrock: (VectorN ) -> Double = { v in
let x = v[0], y = v[1]
return (1-x)
(1-x) + 100*(y-xx)(y-xx)
}

let result = try optimizer.optimize(
objective: rosenbrock,
initialGuess: VectorN([0.0, 0.0]),
constraints: []
)

print(“Solution: (result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))”)
print(“Algorithm used: (result.algorithmUsed ?? “N/A”)”)
print(“Reason: (result.selectionReason ?? “N/A”)”)
Output:
Solution: [1.0000, 1.0000]
Algorithm used: Newton-Raphson
Reason: Small problem (2 variables) - using Newton-Raphson for fast convergence
How it works:
  • Analyzes problem size, constraints, gradient availability
  • Selects: Gradient Descent, Newton-Raphson, BFGS, or Constrained optimizer
  • Reports which algorithm was chosen and why

Choosing the Right Algorithm

Algorithm Speed Stability Memory Best For
Gradient Descent Slow High Low Noisy functions, large-scale (10K+ vars)
Momentum Medium Medium Low Smooth landscapes, valleys
Nesterov Fast Medium Low Convex problems
Full Newton Very Fast Low High Small, smooth quadratic problems
BFGS Fast High Medium Most practical problems (recommended)
AdaptiveOptimizer Varies High Medium Unknown problem characteristics
Rule of thumb:
  • < 100 variables + smooth: Use BFGS
  • 100-10,000 variables: Use Momentum/Nesterov
  • > 10,000 variables: Use basic Gradient Descent
  • Constraints: Use ConstrainedOptimizer or InequalityOptimizer (Week 8 Wednesday)

Real-World Example: Parameter Fitting

Fit a curve to noisy data:
import BusinessMath

// Data: y = a
x² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5…5)
}

// Objective: Minimize sum of squared errors
let objective: (VectorN ) -> Double = { params in
let a = params[0], b = params[1], c = params[2]
var sse = 0.0
for i in 0.. let x = xData[i]
let predicted = a * x * x + b * x + c
let error = yData[i] - predicted
sse += error * error
}
return sse
}

// BFGS for fast convergence
let optimizer = MultivariateNewtonRaphson >()
let result = try optimizer.minimizeBFGS(
function: objective,
gradient: { try numericalGradient(objective, at: $0) },
initialGuess: VectorN([1.0, 2.0, 3.0])
)

print(“Fitted parameters:”)
print(” a = (result_params.solution[0].number(2)) (true: 2.0)”)
print(” b = (result_params.solution[1].number(2)) (true: 3.0)”)
print(” c = (result_params.solution[2].number(2)) (true: 5.0)”)
print(“SSE: (result_params.objectiveValue.number(1))”)
Output:
Fitted parameters:
a = 1.98 (true: 2.0)
b = 3.17 (true: 3.0)
c = 4.82 (true: 5.0)
SSE: 311.7

Try It Yourself

Click to expand full playground code
import BusinessMath

// MARK: - Numerical Differentiation

// Define f(x,y) = x² + 2y²
let function_nd: (VectorN ) -> Double = { v in
let x = v[0]
let y = v[1]
return x*x + 2*y*y
}

// Compute gradient at (1, 2)
let point_nd = VectorN([1.0, 2.0])
let gradient_nd = try numericalGradient(function_nd, at: point_nd)
print("Gradient: \(gradient_nd.toArray())") // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓

// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function_nd, at: point_nd)
print("Hessian:")
for row in hessian {
print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]

// MARK: - Gradient Descent

// Minimize f(x,y) = x² + 4y²
let function_gd: (VectorN ) -> Double = { v in
v[0]*v[0] + 4*v[1]*v[1]
}

// Basic gradient descent
let optimizer_gd = MultivariateGradientDescent >(
learningRate: 0.01,
maxIterations: 1000,
tolerance: 1e-6
)

let result_gd = try optimizer_gd.minimize(
function: function_gd,
gradient: { x in try numericalGradient(function_gd, at: x) },
initialGuess: VectorN([5.0, 5.0])
)

print("Minimum at: \(result_gd.solution.toArray().map({ $0.number(3) }))")
print("Value: \(result_gd.objectiveValue.number(6))")
print("Iterations: \(result_gd.iterations)")
print("Converged: \(result_gd.converged)")

// MARK: Gradient Descent with Momentum

// Rosenbrock function: classic test problem
let rosenbrock: (VectorN ) -> Double = { v in
let x = v[0], y = v[1]
let a = 1 - x
let b = y - x*x
return a*a + 100*b*b // Minimum at (1, 1)
}

// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer (
learningRate: 0.01,
maxIterations: 5000,
momentum: 0.9,
useNesterov: false
)

// Note: Using scalar optimizer for demonstration
// For VectorN, use MultivariateGradientDescent

let result_gdm = optimizerWithMomentum.optimize(
objective: { x in (x - 5) * (x - 5) },
constraints: [],
initialGuess: 0.0,
bounds: nil
)

print("Converged to: \(result_gdm.optimalValue.number(1))")
print("Iterations: \(result_gdm.iterations)")

// MARK: - Newton-Raphson: Quadratic Convergence

// Quadratic function: f(x,y) = x² + 4y² + 2xy
let quadratic: (VectorN ) -> Double = { v in
let x = v[0], y = v[1]
return x*x + 4*y*y + 2*x*y
}

// Full Newton-Raphson (uses exact Hessian)
let newtonOptimizer = MultivariateNewtonRaphson >(
maxIterations: 100,
tolerance: 1e-8,
useLineSearch: true
)

let result_newton = try newtonOptimizer.minimize(
function: quadratic,
gradient: { try numericalGradient(quadratic, at: $0) },
hessian: { try numericalHessian(quadratic, at: $0) },
initialGuess: VectorN([10.0, 10.0])
)

print("Solution: \(result_newton.solution.toArray().map({ $0.number(6) }))")
print("Converged in: \(result_newton.iterations) iterations")

// MARK: - BFGS: Quasi-Newton Sweet Spot

// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson >()

let result_bfgs = try bfgsOptimizer.minimizeBFGS(
function: rosenbrock,
gradient: { try numericalGradient(rosenbrock, at: $0) },
initialGuess: VectorN([0.0, 0.0])
)

print("Solution: \(result_bfgs.solution.toArray().map({ $0.number(4) }))")
print("Iterations: \(result_bfgs.iterations)")
print("Final value: \(result_bfgs.objectiveValue.number(8))")

// MARK: - Adaptive Optimizer

// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer_adaptive = AdaptiveOptimizer >()

let result_adaptive = try optimizer_adaptive.optimize(
objective: rosenbrock,
initialGuess: VectorN([0.0, 0.0]),
constraints: []
)

print("Solution: \(result_adaptive.solution.toArray().map({ $0.number(4) }))")
print("Algorithm used: \(result_adaptive.algorithmUsed)")
print("Reason: \(result_adaptive.selectionReason)")


// MARK: - Parameter Fitting Example

// Data: y = a*x² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5...5)
}

// Objective: Minimize sum of squared errors
let objective_params: (VectorN ) -> Double = { params in
let a = params[0], b = params[1], c = params[2]
var sse = 0.0
for i in 0.. let x = xData[i]
let predicted = a * x * x + b * x + c
let error = yData[i] - predicted
sse += error * error
}
return sse
}

// BFGS for fast convergence
let optimizer_params = MultivariateNewtonRaphson >()
let result_params = try optimizer_params.minimizeBFGS(
function: objective_params,
gradient: { try numericalGradient(objective_params, at: $0) },
initialGuess: VectorN([1.0, 2.0, 3.0])
)

print("Fitted parameters:")
print(" a = \(result_params.solution[0].number(2)) (true: 2.0)")
print(" b = \(result_params.solution[1].number(2)) (true: 3.0)")
print(" c = \(result_params.solution[2].number(2)) (true: 5.0)")
print("SSE: \(result_params.objectiveValue.number(1))")

→ Full API Reference: BusinessMath Docs – 5.5 Multivariate Optimization

Modifications to try:

  1. Compare convergence rates: plot iteration vs. objective value for each algorithm
  2. Fit a 10-parameter model (polynomial, exponential, custom function)
  3. Optimize a high-dimensional problem (100+ variables) with Momentum
  4. Test robustness: start from different initial guesses and compare results

Real-World Application

Data scientist use case: “I need to fit a pricing model with 15 parameters to historical transaction data. Manual tuning is infeasible—I need automated optimization.”

BFGS converges in seconds, not hours of manual tweaking.


★ Insight ─────────────────────────────────────

Why BFGS Beats Full Newton for Most Problems

Full Newton requires computing the Hessian (N² second derivatives). For a 100-variable problem:

BFGS maintains a Hessian approximation updated using gradient changes:
H_{k+1} = H_k + correction terms based on (∇f_{k+1} - ∇f_k)
Result: BFGS gets ~90% of Newton’s convergence speed with ~10% of the cost.

When Full Newton wins: Very small problems (< 10 variables) where Hessian computation is cheap and you need extreme precision.

─────────────────────────────────────────────────


📝 Development Note

The hardest challenge was making numerical differentiation robust across all Real types (Float, Double, Decimal).

Problem: The step size h for (f(x+h) - f(x-h)) / 2h must be:

Solution: Adaptive step size h = √ε × max(|x|, 1) where ε is machine epsilon: This automatically adjusts to the numeric type’s precision.

Related Methodology: Numerical Stability (Week 2) - Covered machine epsilon and catastrophic cancellation.


Next Steps

Coming up Wednesday: Constrained Optimization - Lagrange multipliers, augmented Lagrangian methods, and handling equality/inequality constraints.
Series Progress:


Tagged with: optimization