Multivariate Optimization: Gradient Descent to Newton-Raphson
BusinessMath Quarterly Series
13 min read
Part 26 of 12-Week BusinessMath Series
What You’ll Learn
- Understanding numerical differentiation for gradients and Hessians
- Using gradient descent with momentum and Nesterov acceleration
- Applying Newton-Raphson and BFGS quasi-Newton methods
- Choosing the right optimization algorithm for your problem
- Using AdaptiveOptimizer for automatic algorithm selection
- Understanding convergence rates and algorithm trade-offs
The Problem
Real-world optimization problems have multiple variables:- Parameter fitting: Minimize error across 10+ parameters
- Cost minimization: Optimize production mix across multiple products/facilities
- Portfolio construction: Find optimal weights for N assets
- Machine learning: Fit models with thousands of parameters
The Solution
BusinessMath provides a progression of multivariate optimizers, from simple gradient descent to sophisticated second-order methods. All work generically with anyVectorSpace type.
Numerical Differentiation
When you can’t compute derivatives analytically, BusinessMath computes them numerically:import BusinessMath
// Define f(x,y) = x² + 2y²
let function: (VectorN
) -> Double = { v in
let x = v[0]
let y = v[1]
return x
x + 2y
y
}
// Compute gradient at (1, 2)
let point = VectorN([1.0, 2.0])
let gradient = try numericalGradient(function, at: point)
print(“Gradient: (gradient.toArray())”) // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓
// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function, at: point)
print(“Hessian:”)
for row in hessian {
print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]
Output:Gradient: [2.0, 8.0]
Hessian:
[2.0, 0.0]
[0.0, 4.0]
How it works:
- Gradient: Central finite differences
(f(x+h) - f(x-h)) / 2h - Hessian: Second-order finite differences (N² function evaluations)
Gradient Descent: The Workhorse
Gradient descent iteratively moves in the direction of steepest descent:import BusinessMath
// Minimize f(x,y) = x² + 4y²
let function: (VectorN
) -> Double = { v in
v[0]
v[0] + 4v[1]
v[1]
}
// Basic gradient descent
let optimizer = MultivariateGradientDescent
>(
learningRate: 0.01,
maxIterations: 1000,
tolerance: 1e-6
)
let result = try optimizer.minimize(
function: function,
gradient: { x in try numericalGradient(function, at: x) },
initialGuess: VectorN([5.0, 5.0])
)
print(“Minimum at: (result.solution.toArray().map({ $0.number(3) }))”)
print(“Value: (result.objectiveValue.number(6))”)
print(“Iterations: (result.iterations)”)
print(“Converged: (result.converged)”)
Output:Minimum at: [0.0, 0.0]
Value: 0.000000
Iterations: 247
Converged: true
Gradient Descent with Momentum
Momentum accelerates convergence and reduces oscillations:Nesterov Acceleration (look-ahead gradient) often converges even faster:import BusinessMath
// Rosenbrock function: classic test problem
let rosenbrock: (VectorN) -> Double = { v in
let x = v[0], y = v[1]
let a = 1 - x
let b = y - xx
return aa + 100bb // Minimum at (1, 1)
}
// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer(
learningRate: 0.01,
maxIterations: 5000,
momentum: 0.9,
useNesterov: false
)
// Note: Using scalar optimizer for demonstration
// For VectorN, use MultivariateGradientDescent
let result = optimizerWithMomentum.optimize(
objective: { x in (x - 5) * (x - 5) },
constraints: [],
initialGuess: 0.0,
bounds: nil
)
print(“Converged to: (result.optimalValue.number(4))”)
print(“Iterations: (result.iterations)”)
let nesterovOptimizer = GradientDescentOptimizer
(
learningRate: 0.01,
momentum: 0.9,
useNesterov: true // Nesterov acceleration
)
Newton-Raphson: Quadratic Convergence
Newton-Raphson uses second-order information (Hessian) for much faster convergence:import BusinessMath
// Quadratic function: f(x,y) = x² + 4y² + 2xy
let quadratic: (VectorN
) -> Double = { v in
let x = v[0], y = v[1]
return x
x + 4y
y + 2x
y
}
// Full Newton-Raphson (uses exact Hessian)
let newtonOptimizer = MultivariateNewtonRaphson
>(
maxIterations: 100,
tolerance: 1e-8,
useLineSearch: true
)
let result = try newtonOptimizer.minimize(
function: quadratic,
gradient: { try numericalGradient(quadratic, at: $0) },
hessian: { try numericalHessian(quadratic, at: $0) },
initialGuess: VectorN([10.0, 10.0])
)
print(“Solution: (result.solution.toArray().map({ $0.number(6) }))”)
print(“Converged in: (result.iterations) iterations”)
Output:Solution: [0.000000, 0.000000]
Converged in: 3 iterations
The power: Newton-Raphson found the minimum in 3 iterations vs. 247 for gradient descent!
BFGS: Quasi-Newton Sweet Spot
BFGS approximates the Hessian, giving Newton-like speed without expensive Hessian computation:Output:import BusinessMath
let rosenbrock: (VectorN) -> Double = { v in
let x = v[0], y = v[1]
let a = 1 - x
let b = y - xx
return aa + 100bb
}
// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson>()
let result = try bfgsOptimizer.minimizeBFGS(
function: rosenbrock,
gradient: { try numericalGradient(rosenbrock, at: $0) },
initialGuess: VectorN([0.0, 0.0])
)
print(“Solution: (result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))”)
print(“Iterations: (result.iterations)”)
print(“Final value: (result.objectiveValue.rounded(toPlaces: 8))”)
Solution: [1.0000, 1.0000]
Iterations: 24
Final value: 0.00000001
Comparison:
| Method | Iterations | Function Evals | Speed |
|---|---|---|---|
| Gradient Descent | 4,782 | ~10,000 | Slow |
| Momentum/Nesterov | 1,200 | ~2,500 | Medium |
| Full Newton | 12 | ~150 | Very Fast |
| BFGS | 24 | ~50 | Fast |
AdaptiveOptimizer: Automatic Algorithm Selection
Don’t know which algorithm to use? Let AdaptiveOptimizer decide:Output:import BusinessMath
// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer = AdaptiveOptimizer>()
let rosenbrock: (VectorN) -> Double = { v in
let x = v[0], y = v[1]
return (1-x)(1-x) + 100*(y-xx)(y-xx)
}
let result = try optimizer.optimize(
objective: rosenbrock,
initialGuess: VectorN([0.0, 0.0]),
constraints: []
)
print(“Solution: (result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))”)
print(“Algorithm used: (result.algorithmUsed ?? “N/A”)”)
print(“Reason: (result.selectionReason ?? “N/A”)”)
Solution: [1.0000, 1.0000]
Algorithm used: Newton-Raphson
Reason: Small problem (2 variables) - using Newton-Raphson for fast convergence
How it works:
- Analyzes problem size, constraints, gradient availability
- Selects: Gradient Descent, Newton-Raphson, BFGS, or Constrained optimizer
- Reports which algorithm was chosen and why
Choosing the Right Algorithm
| Algorithm | Speed | Stability | Memory | Best For |
|---|---|---|---|---|
| Gradient Descent | Slow | High | Low | Noisy functions, large-scale (10K+ vars) |
| Momentum | Medium | Medium | Low | Smooth landscapes, valleys |
| Nesterov | Fast | Medium | Low | Convex problems |
| Full Newton | Very Fast | Low | High | Small, smooth quadratic problems |
| BFGS | Fast | High | Medium | Most practical problems (recommended) |
| AdaptiveOptimizer | Varies | High | Medium | Unknown problem characteristics |
- < 100 variables + smooth: Use BFGS
- 100-10,000 variables: Use Momentum/Nesterov
- > 10,000 variables: Use basic Gradient Descent
- Constraints: Use ConstrainedOptimizer or InequalityOptimizer (Week 8 Wednesday)
Real-World Example: Parameter Fitting
Fit a curve to noisy data:Output:import BusinessMath
// Data: y = ax² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5…5)
}
// Objective: Minimize sum of squared errors
let objective: (VectorN) -> Double = { params in
let a = params[0], b = params[1], c = params[2]
var sse = 0.0
for i in 0..let x = xData[i]
let predicted = a * x * x + b * x + c
let error = yData[i] - predicted
sse += error * error
}
return sse
}
// BFGS for fast convergence
let optimizer = MultivariateNewtonRaphson>()
let result = try optimizer.minimizeBFGS(
function: objective,
gradient: { try numericalGradient(objective, at: $0) },
initialGuess: VectorN([1.0, 2.0, 3.0])
)
print(“Fitted parameters:”)
print(” a = (result_params.solution[0].number(2)) (true: 2.0)”)
print(” b = (result_params.solution[1].number(2)) (true: 3.0)”)
print(” c = (result_params.solution[2].number(2)) (true: 5.0)”)
print(“SSE: (result_params.objectiveValue.number(1))”)
Fitted parameters:
a = 1.98 (true: 2.0)
b = 3.17 (true: 3.0)
c = 4.82 (true: 5.0)
SSE: 311.7
Try It Yourself
Click to expand full playground code
import BusinessMath
// MARK: - Numerical Differentiation
// Define f(x,y) = x² + 2y²
let function_nd: (VectorN
) -> Double = { v in
let x = v[0]
let y = v[1]
return x*x + 2*y*y
}
// Compute gradient at (1, 2)
let point_nd = VectorN([1.0, 2.0])
let gradient_nd = try numericalGradient(function_nd, at: point_nd)
print("Gradient: \(gradient_nd.toArray())") // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓
// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function_nd, at: point_nd)
print("Hessian:")
for row in hessian {
print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]
// MARK: - Gradient Descent
// Minimize f(x,y) = x² + 4y²
let function_gd: (VectorN
) -> Double = { v in
v[0]*v[0] + 4*v[1]*v[1]
}
// Basic gradient descent
let optimizer_gd = MultivariateGradientDescent
>(
learningRate: 0.01,
maxIterations: 1000,
tolerance: 1e-6
)
let result_gd = try optimizer_gd.minimize(
function: function_gd,
gradient: { x in try numericalGradient(function_gd, at: x) },
initialGuess: VectorN([5.0, 5.0])
)
print("Minimum at: \(result_gd.solution.toArray().map({ $0.number(3) }))")
print("Value: \(result_gd.objectiveValue.number(6))")
print("Iterations: \(result_gd.iterations)")
print("Converged: \(result_gd.converged)")
// MARK: Gradient Descent with Momentum
// Rosenbrock function: classic test problem
let rosenbrock: (VectorN
) -> Double = { v in
let x = v[0], y = v[1]
let a = 1 - x
let b = y - x*x
return a*a + 100*b*b // Minimum at (1, 1)
}
// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer
(
learningRate: 0.01,
maxIterations: 5000,
momentum: 0.9,
useNesterov: false
)
// Note: Using scalar optimizer for demonstration
// For VectorN, use MultivariateGradientDescent
let result_gdm = optimizerWithMomentum.optimize(
objective: { x in (x - 5) * (x - 5) },
constraints: [],
initialGuess: 0.0,
bounds: nil
)
print("Converged to: \(result_gdm.optimalValue.number(1))")
print("Iterations: \(result_gdm.iterations)")
// MARK: - Newton-Raphson: Quadratic Convergence
// Quadratic function: f(x,y) = x² + 4y² + 2xy
let quadratic: (VectorN
) -> Double = { v in
let x = v[0], y = v[1]
return x*x + 4*y*y + 2*x*y
}
// Full Newton-Raphson (uses exact Hessian)
let newtonOptimizer = MultivariateNewtonRaphson
>(
maxIterations: 100,
tolerance: 1e-8,
useLineSearch: true
)
let result_newton = try newtonOptimizer.minimize(
function: quadratic,
gradient: { try numericalGradient(quadratic, at: $0) },
hessian: { try numericalHessian(quadratic, at: $0) },
initialGuess: VectorN([10.0, 10.0])
)
print("Solution: \(result_newton.solution.toArray().map({ $0.number(6) }))")
print("Converged in: \(result_newton.iterations) iterations")
// MARK: - BFGS: Quasi-Newton Sweet Spot
// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson
>()
let result_bfgs = try bfgsOptimizer.minimizeBFGS(
function: rosenbrock,
gradient: { try numericalGradient(rosenbrock, at: $0) },
initialGuess: VectorN([0.0, 0.0])
)
print("Solution: \(result_bfgs.solution.toArray().map({ $0.number(4) }))")
print("Iterations: \(result_bfgs.iterations)")
print("Final value: \(result_bfgs.objectiveValue.number(8))")
// MARK: - Adaptive Optimizer
// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer_adaptive = AdaptiveOptimizer
>()
let result_adaptive = try optimizer_adaptive.optimize(
objective: rosenbrock,
initialGuess: VectorN([0.0, 0.0]),
constraints: []
)
print("Solution: \(result_adaptive.solution.toArray().map({ $0.number(4) }))")
print("Algorithm used: \(result_adaptive.algorithmUsed)")
print("Reason: \(result_adaptive.selectionReason)")
// MARK: - Parameter Fitting Example
// Data: y = a*x² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5...5)
}
// Objective: Minimize sum of squared errors
let objective_params: (VectorN
) -> Double = { params in
let a = params[0], b = params[1], c = params[2]
var sse = 0.0
for i in 0..
let x = xData[i]
let predicted = a * x * x + b * x + c
let error = yData[i] - predicted
sse += error * error
}
return sse
}
// BFGS for fast convergence
let optimizer_params = MultivariateNewtonRaphson
>()
let result_params = try optimizer_params.minimizeBFGS(
function: objective_params,
gradient: { try numericalGradient(objective_params, at: $0) },
initialGuess: VectorN([1.0, 2.0, 3.0])
)
print("Fitted parameters:")
print(" a = \(result_params.solution[0].number(2)) (true: 2.0)")
print(" b = \(result_params.solution[1].number(2)) (true: 3.0)")
print(" c = \(result_params.solution[2].number(2)) (true: 5.0)")
print("SSE: \(result_params.objectiveValue.number(1))")
Modifications to try:
- Compare convergence rates: plot iteration vs. objective value for each algorithm
- Fit a 10-parameter model (polynomial, exponential, custom function)
- Optimize a high-dimensional problem (100+ variables) with Momentum
- Test robustness: start from different initial guesses and compare results
Real-World Application
- Machine learning: Train models by minimizing loss functions
- Engineering: Optimize design parameters (aerodynamics, materials, structures)
- Finance: Calibrate option pricing models to market data
- Operations: Optimize production schedules, routing, inventory
BFGS converges in seconds, not hours of manual tweaking.
★ Insight ─────────────────────────────────────
Why BFGS Beats Full Newton for Most Problems
Full Newton requires computing the Hessian (N² second derivatives). For a 100-variable problem:
- Full Newton: 10,000 Hessian elements per iteration
- BFGS: 0 Hessian evaluations (approximated from gradients)
H_{k+1} = H_k + correction terms based on (∇f_{k+1} - ∇f_k)
Result: BFGS gets ~90% of Newton’s convergence speed with ~10% of the cost.
When Full Newton wins: Very small problems (< 10 variables) where Hessian computation is cheap and you need extreme precision.
─────────────────────────────────────────────────
📝 Development Note
The hardest challenge was making numerical differentiation robust across all Real types (Float, Double, Decimal).Problem: The step size h for (f(x+h) - f(x-h)) / 2h must be:
- Small enough to approximate the true derivative
- Large enough to avoid catastrophic cancellation (subtracting nearly equal floating-point numbers)
h = √ε × max(|x|, 1) where ε is machine epsilon:
- Float (ε ≈ 10⁻⁷): h ≈ 10⁻³
- Double (ε ≈ 10⁻¹⁶): h ≈ 10⁻⁸
- Decimal (ε ≈ 10⁻²⁸): h ≈ 10⁻¹⁴
Related Methodology: Numerical Stability (Week 2) - Covered machine epsilon and catastrophic cancellation.
Next Steps
Coming up Wednesday: Constrained Optimization - Lagrange multipliers, augmented Lagrangian methods, and handling equality/inequality constraints.Series Progress:
- Week: 8/12
- Posts Published: 26/~48
- Playgrounds: 22 available
Tagged with: optimization