Multivariate Optimization: Gradient Descent to Newton-Raphson

BusinessMath Quarterly Series

13 min read

Part 26 of 12-Week BusinessMath Series

What You’ll Learn

Understanding numerical differentiation for gradients and Hessians
Using gradient descent with momentum and Nesterov acceleration
Applying Newton-Raphson and BFGS quasi-Newton methods
Choosing the right optimization algorithm for your problem
Using AdaptiveOptimizer for automatic algorithm selection
Understanding convergence rates and algorithm trade-offs

The Problem

Real-world optimization problems have multiple variables:

Parameter fitting: Minimize error across 10+ parameters
Cost minimization: Optimize production mix across multiple products/facilities
Portfolio construction: Find optimal weights for N assets
Machine learning: Fit models with thousands of parameters

Single-variable methods (like goal-seeking) don’t extend to multivariate problems—you need algorithms designed for N dimensions.

The Solution

BusinessMath provides a progression of multivariate optimizers, from simple gradient descent to sophisticated second-order methods. All work generically with any VectorSpace type.

Numerical Differentiation

When you can’t compute derivatives analytically, BusinessMath computes them numerically:

import BusinessMath

// Define f(x,y) = x² + 2y²
let function: (VectorN
            
              ) -> Double = { v in
              
 let x = v[0]
              
 let y = v[1]
              
 return x
              x + 2y
              y
}

// Compute gradient at (1, 2)
let point = VectorN([1.0, 2.0])
let gradient = try numericalGradient(function, at: point)
print(“Gradient: (gradient.toArray())”) // ≈ [2.0, 8.0]
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓

// Compute Hessian (curvature matrix)
let hessian = try numericalHessian(function, at: point)
print(“Hessian:”)
for row in hessian {
 print(row.map { $0.number(1) })
}
// [[2.0, 0.0], [0.0, 4.0]]

Output:
Gradient: [2.0, 8.0] Hessian: [2.0, 0.0] [0.0, 4.0]
How it works:

Gradient: Central finite differences (f(x+h) - f(x-h)) / 2h

Hessian: Second-order finite differences (N² function evaluations)

Gradient Descent: The Workhorse
Gradient descent iteratively moves in the direction of steepest descent:
import BusinessMath // Minimize f(x,y) = x² + 4y² let function: (VectorN ) -> Double = { v in v[0] v[0] + 4v[1] v[1] } // Basic gradient descent let optimizer = MultivariateGradientDescent >( learningRate: 0.01, maxIterations: 1000, tolerance: 1e-6 ) let result = try optimizer.minimize( function: function, gradient: { x in try numericalGradient(function, at: x) }, initialGuess: VectorN([5.0, 5.0]) ) print(“Minimum at: (result.solution.toArray().map({ $0.number(3) }))”) print(“Value: (result.objectiveValue.number(6))”) print(“Iterations: (result.iterations)”) print(“Converged: (result.converged)”)
Output:
Minimum at: [0.0, 0.0] Value: 0.000000 Iterations: 247 Converged: true

Gradient Descent with Momentum
Momentum accelerates convergence and reduces oscillations:

import BusinessMath

// Rosenbrock function: classic test problem
let rosenbrock: (VectorN
                
                  ) -> Double = { v in
                  
 let x = v[0], y = v[1]
                  
 let a = 1 - x
                  
 let b = y - x
                x
 return aa + 100bb  // Minimum at (1, 1)
}

// Gradient descent with momentum (default 0.9)
let optimizerWithMomentum = GradientDescentOptimizer
            
              (
              
 learningRate: 0.01,
              
 maxIterations: 5000,
              
 momentum: 0.9,
              
 useNesterov: false
              
)
              

              
// Note: Using scalar optimizer for demonstration
              
// For VectorN, use MultivariateGradientDescent
              

              
let result = optimizerWithMomentum.optimize(
              
 objective: { x in (x - 5) * (x - 5) },
              
 constraints: [],
              
 initialGuess: 0.0,
              
 bounds: nil
              
)
              

              
print(“Converged to: (result.optimalValue.number(4))”)
              
print(“Iterations: (result.iterations)”)

Nesterov Acceleration (look-ahead gradient) often converges even faster:

let nesterovOptimizer = GradientDescentOptimizer
            
              (
              
 learningRate: 0.01,
              
 momentum: 0.9,
              
 useNesterov: true // Nesterov acceleration
              
)

Newton-Raphson: Quadratic Convergence

Newton-Raphson uses second-order information (Hessian) for much faster convergence:

import BusinessMath

// Quadratic function: f(x,y) = x² + 4y² + 2xy
let quadratic: (VectorN
            
              ) -> Double = { v in
              
 let x = v[0], y = v[1]
              
 return x
              x + 4y
              y + 2x
              y
}

// Full Newton-Raphson (uses exact Hessian)
let newtonOptimizer = MultivariateNewtonRaphson
                
                  >(
                  
 maxIterations: 100,
                  
 tolerance: 1e-8,
                  
 useLineSearch: true
                  
)
                  

                  
let result = try newtonOptimizer.minimize(
                  
 function: quadratic,
                  
 gradient: { try numericalGradient(quadratic, at: $0) },
                  
 hessian: { try numericalHessian(quadratic, at: $0) },
                  
 initialGuess: VectorN([10.0, 10.0])
                  
)
                  

                  
print(“Solution: (result.solution.toArray().map({ $0.number(6) }))”)
                  
print(“Converged in: (result.iterations) iterations”)

Output:
Solution: [0.000000, 0.000000] Converged in: 3 iterations
The power: Newton-Raphson found the minimum in 3 iterations vs. 247 for gradient descent!
BFGS: Quasi-Newton Sweet Spot
BFGS approximates the Hessian, giving Newton-like speed without expensive Hessian computation:

import BusinessMath

let rosenbrock: (VectorN
              
                ) -> Double = { v in
                
 let x = v[0], y = v[1]
                
 let a = 1 - x
                
 let b = y - x
              x
    return aa + 100bb
}

// BFGS quasi-Newton
let bfgsOptimizer = MultivariateNewtonRaphson
              
                >()
                

                
let result = try bfgsOptimizer.minimizeBFGS(
                
 function: rosenbrock,
                
 gradient: { try numericalGradient(rosenbrock, at: $0) },
                
 initialGuess: VectorN([0.0, 0.0])
                
)
                

                
print(“Solution: (result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))”)
                
print(“Iterations: (result.iterations)”)
                
print(“Final value: (result.objectiveValue.rounded(toPlaces: 8))”)

Method	Iterations	Function Evals	Speed
Gradient Descent	4,782	~10,000	Slow
Momentum/Nesterov	1,200	~2,500	Medium
Full Newton	12	~150	Very Fast
BFGS	24	~50	Fast

Output:
Solution: [1.0000, 1.0000] Iterations: 24 Final value: 0.00000001
Comparison:

Method Iterations Function Evals Speed

Gradient Descent 4,782 ~10,000 Slow

Momentum/Nesterov 1,200 ~2,500 Medium

Full Newton 12 ~150 Very Fast

BFGS 24 ~50 Fast

The trade-off: BFGS balances speed and computational cost—best for most practical problems.
AdaptiveOptimizer: Automatic Algorithm Selection
Don’t know which algorithm to use? Let AdaptiveOptimizer decide:

import BusinessMath

// AdaptiveOptimizer chooses the best algorithm automatically
let optimizer = AdaptiveOptimizer
              
                >()
                

                
let rosenbrock: (VectorN
                
                  ) -> Double = { v in
                  
 let x = v[0], y = v[1]
                  
 return (1-x)
                
              (1-x) + 100*(y-xx)(y-xx)
}

let result = try optimizer.optimize(
 objective: rosenbrock,
 initialGuess: VectorN([0.0, 0.0]),
 constraints: []
)

print(“Solution: (result.solution.toArray().map({ $0.rounded(toPlaces: 4) }))”)
print(“Algorithm used: (result.algorithmUsed ?? “N/A”)”)
print(“Reason: (result.selectionReason ?? “N/A”)”)

Algorithm	Speed	Stability	Memory	Best For
Gradient Descent	Slow	High	Low	Noisy functions, large-scale (10K+ vars)
Momentum	Medium	Medium	Low	Smooth landscapes, valleys
Nesterov	Fast	Medium	Low	Convex problems
Full Newton	Very Fast	Low	High	Small, smooth quadratic problems
BFGS	Fast	High	Medium	Most practical problems (recommended)
AdaptiveOptimizer	Varies	High	Medium	Unknown problem characteristics

Output:
Solution: [1.0000, 1.0000] Algorithm used: Newton-Raphson Reason: Small problem (2 variables) - using Newton-Raphson for fast convergence
How it works:

Analyzes problem size, constraints, gradient availability

Selects: Gradient Descent, Newton-Raphson, BFGS, or Constrained optimizer

Reports which algorithm was chosen and why

Choosing the Right Algorithm

Algorithm Speed Stability Memory Best For

Gradient Descent Slow High Low Noisy functions, large-scale (10K+ vars)

Momentum Medium Medium Low Smooth landscapes, valleys

Nesterov Fast Medium Low Convex problems

Full Newton Very Fast Low High Small, smooth quadratic problems

BFGS Fast High Medium Most practical problems (recommended)

AdaptiveOptimizer Varies High Medium Unknown problem characteristics

Rule of thumb:

< 100 variables + smooth: Use BFGS

100-10,000 variables: Use Momentum/Nesterov

> 10,000 variables: Use basic Gradient Descent

Constraints: Use ConstrainedOptimizer or InequalityOptimizer (Week 8 Wednesday)

Real-World Example: Parameter Fitting
Fit a curve to noisy data:

import BusinessMath

// Data: y = ax² + b*x + c + noise
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
let yData = xData.map { x in
    2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5…5)
}

// Objective: Minimize sum of squared errors
let objective: (VectorN
            
              ) -> Double = { params in
              
 let a = params[0], b = params[1], c = params[2]
              
 var sse = 0.0
              
 for i in 0..
              
                 let x = xData[i]
                
 let predicted = a * x * x + b * x + c
                
 let error = yData[i] - predicted
                
 sse += error * error
                
 }
                
 return sse
                
}
                

                
// BFGS for fast convergence
                
let optimizer = MultivariateNewtonRaphson
                
                  >()
                  
let result = try optimizer.minimizeBFGS(
                  
 function: objective,
                  
 gradient: { try numericalGradient(objective, at: $0) },
                  
 initialGuess: VectorN([1.0, 2.0, 3.0])
                  
)
                  

                  
print(“Fitted parameters:”)
                  
print(” a = (result_params.solution[0].number(2)) (true: 2.0)”)
                  
print(” b = (result_params.solution[1].number(2)) (true: 3.0)”)
                  
print(” c = (result_params.solution[2].number(2)) (true: 5.0)”)
                  
print(“SSE: (result_params.objectiveValue.number(1))”)

Output:

Fitted parameters:
  a = 1.98 (true: 2.0)
  b = 3.17 (true: 3.0)
  c = 4.82 (true: 5.0)
SSE: 311.7

Try It Yourself

Click to expand full playground code

import BusinessMath

// MARK: - Numerical Differentiation

// Define f(x,y) = x² + 2y²
let function_nd: (VectorN
              
                ) -> Double = { v in
                
 let x = v[0]
                
 let y = v[1]
                
 return x*x + 2*y*y
                
}
                

                
// Compute gradient at (1, 2)
                
let point_nd = VectorN([1.0, 2.0])
                
let gradient_nd = try numericalGradient(function_nd, at: point_nd)
                
print("Gradient: \(gradient_nd.toArray())") // ≈ [2.0, 8.0]
                
// Analytical: ∂f/∂x = 2x = 2, ∂f/∂y = 4y = 8 ✓
                

                
// Compute Hessian (curvature matrix)
                
let hessian = try numericalHessian(function_nd, at: point_nd)
                
print("Hessian:")
                
for row in hessian {
                
 print(row.map { $0.number(1) })
                
}
                
// [[2.0, 0.0], [0.0, 4.0]]
                

                
// MARK: - Gradient Descent
                

                
// Minimize f(x,y) = x² + 4y²
                
let function_gd: (VectorN
                
                  ) -> Double = { v in
                  
 v[0]*v[0] + 4*v[1]*v[1]
                  
}
                  

                  
// Basic gradient descent
                  
let optimizer_gd = MultivariateGradientDescent
                  
                    >(
                    
 learningRate: 0.01,
                    
 maxIterations: 1000,
                    
 tolerance: 1e-6
                    
)
                    

                    
let result_gd = try optimizer_gd.minimize(
                    
 function: function_gd,
                    
 gradient: { x in try numericalGradient(function_gd, at: x) },
                    
 initialGuess: VectorN([5.0, 5.0])
                    
)
                    

                    
print("Minimum at: \(result_gd.solution.toArray().map({ $0.number(3) }))")
                    
print("Value: \(result_gd.objectiveValue.number(6))")
                    
print("Iterations: \(result_gd.iterations)")
                    
print("Converged: \(result_gd.converged)")
                    

                    
// MARK: Gradient Descent with Momentum
                    

                    
// Rosenbrock function: classic test problem
                    
let rosenbrock: (VectorN
                    
                      ) -> Double = { v in
                      
 let x = v[0], y = v[1]
                      
 let a = 1 - x
                      
 let b = y - x*x
                      
 return a*a + 100*b*b // Minimum at (1, 1)
                      
}
                      

                      
// Gradient descent with momentum (default 0.9)
                      
let optimizerWithMomentum = GradientDescentOptimizer
                      
                        (
                        
 learningRate: 0.01,
                        
 maxIterations: 5000,
                        
 momentum: 0.9,
                        
 useNesterov: false
                        
)
                        

                        
// Note: Using scalar optimizer for demonstration
                        
// For VectorN, use MultivariateGradientDescent
                        

                        
let result_gdm = optimizerWithMomentum.optimize(
                        
 objective: { x in (x - 5) * (x - 5) },
                        
 constraints: [],
                        
 initialGuess: 0.0,
                        
 bounds: nil
                        
)
                        

                        
print("Converged to: \(result_gdm.optimalValue.number(1))")
                        
print("Iterations: \(result_gdm.iterations)")
                        

                        
// MARK: - Newton-Raphson: Quadratic Convergence
                        

                        
 // Quadratic function: f(x,y) = x² + 4y² + 2xy
                        
 let quadratic: (VectorN
                        
                          ) -> Double = { v in
                          
 let x = v[0], y = v[1]
                          
 return x*x + 4*y*y + 2*x*y
                          
 }
                          

                          
 // Full Newton-Raphson (uses exact Hessian)
                          
 let newtonOptimizer = MultivariateNewtonRaphson
                          
                            >(
                            
 maxIterations: 100,
                            
 tolerance: 1e-8,
                            
 useLineSearch: true
                            
 )
                            

                            
 let result_newton = try newtonOptimizer.minimize(
                            
 function: quadratic,
                            
 gradient: { try numericalGradient(quadratic, at: $0) },
                            
 hessian: { try numericalHessian(quadratic, at: $0) },
                            
 initialGuess: VectorN([10.0, 10.0])
                            
 )
                            

                            
 print("Solution: \(result_newton.solution.toArray().map({ $0.number(6) }))")
                            
 print("Converged in: \(result_newton.iterations) iterations")
                            

                            
// MARK: - BFGS: Quasi-Newton Sweet Spot
                            

                            
// BFGS quasi-Newton
                            
let bfgsOptimizer = MultivariateNewtonRaphson
                            
                              >()
                              

                              
let result_bfgs = try bfgsOptimizer.minimizeBFGS(
                              
 function: rosenbrock,
                              
 gradient: { try numericalGradient(rosenbrock, at: $0) },
                              
 initialGuess: VectorN([0.0, 0.0])
                              
)
                              

                              
print("Solution: \(result_bfgs.solution.toArray().map({ $0.number(4) }))")
                              
print("Iterations: \(result_bfgs.iterations)")
                              
print("Final value: \(result_bfgs.objectiveValue.number(8))")
                              

                              
// MARK: - Adaptive Optimizer
                              

                              
// AdaptiveOptimizer chooses the best algorithm automatically
                              
let optimizer_adaptive = AdaptiveOptimizer
                              
                                >()
                                

                                
let result_adaptive = try optimizer_adaptive.optimize(
                                
 objective: rosenbrock,
                                
 initialGuess: VectorN([0.0, 0.0]),
                                
 constraints: []
                                
)
                                

                                
print("Solution: \(result_adaptive.solution.toArray().map({ $0.number(4) }))")
                                
print("Algorithm used: \(result_adaptive.algorithmUsed)")
                                
print("Reason: \(result_adaptive.selectionReason)")
                                

                                

                                
// MARK: - Parameter Fitting Example
                                

                                
// Data: y = a*x² + b*x + c + noise
                                
let xData = VectorN.linearSpace(from: 0.0, to: 10.0, count: 50)
                                
let yData = xData.map { x in
                                
 2.0 * x * x + 3.0 * x + 5.0 + Double.random(in: -5...5)
                                
}
                                

                                
// Objective: Minimize sum of squared errors
                                
let objective_params: (VectorN
                                
                                  ) -> Double = { params in
                                  
 let a = params[0], b = params[1], c = params[2]
                                  
 var sse = 0.0
                                  
 for i in 0..
                                  
                                     let x = xData[i]
                                    
 let predicted = a * x * x + b * x + c
                                    
 let error = yData[i] - predicted
                                    
 sse += error * error
                                    
 }
                                    
 return sse
                                    
}
                                    

                                    
// BFGS for fast convergence
                                    
let optimizer_params = MultivariateNewtonRaphson
                                    
                                      >()
                                      
let result_params = try optimizer_params.minimizeBFGS(
                                      
 function: objective_params,
                                      
 gradient: { try numericalGradient(objective_params, at: $0) },
                                      
 initialGuess: VectorN([1.0, 2.0, 3.0])
                                      
)
                                      

                                      
print("Fitted parameters:")
                                      
print(" a = \(result_params.solution[0].number(2)) (true: 2.0)")
                                      
print(" b = \(result_params.solution[1].number(2)) (true: 3.0)")
                                      
print(" c = \(result_params.solution[2].number(2)) (true: 5.0)")
                                      
print("SSE: \(result_params.objectiveValue.number(1))")

→ Full API Reference: BusinessMath Docs – 5.5 Multivariate Optimization

Modifications to try:

Compare convergence rates: plot iteration vs. objective value for each algorithm
Fit a 10-parameter model (polynomial, exponential, custom function)
Optimize a high-dimensional problem (100+ variables) with Momentum
Test robustness: start from different initial guesses and compare results

Real-World Application

Machine learning: Train models by minimizing loss functions
Engineering: Optimize design parameters (aerodynamics, materials, structures)
Finance: Calibrate option pricing models to market data
Operations: Optimize production schedules, routing, inventory

Data scientist use case: “I need to fit a pricing model with 15 parameters to historical transaction data. Manual tuning is infeasible—I need automated optimization.”

BFGS converges in seconds, not hours of manual tweaking.

★ Insight ─────────────────────────────────────

Why BFGS Beats Full Newton for Most Problems

Full Newton requires computing the Hessian (N² second derivatives). For a 100-variable problem:

Full Newton: 10,000 Hessian elements per iteration
BFGS: 0 Hessian evaluations (approximated from gradients)

BFGS maintains a Hessian approximation updated using gradient changes:

H_{k+1} = H_k + correction terms based on (∇f_{k+1} - ∇f_k)

Result: BFGS gets ~90% of Newton’s convergence speed with ~10% of the cost.

When Full Newton wins: Very small problems (< 10 variables) where Hessian computation is cheap and you need extreme precision.

─────────────────────────────────────────────────

📝 Development Note

The hardest challenge was making numerical differentiation robust across all Real types (Float, Double, Decimal).

Problem: The step size h for (f(x+h) - f(x-h)) / 2h must be:

Small enough to approximate the true derivative
Large enough to avoid catastrophic cancellation (subtracting nearly equal floating-point numbers)

Solution: Adaptive step size h = √ε × max(|x|, 1) where ε is machine epsilon:

Float (ε ≈ 10⁻⁷): h ≈ 10⁻³
Double (ε ≈ 10⁻¹⁶): h ≈ 10⁻⁸
Decimal (ε ≈ 10⁻²⁸): h ≈ 10⁻¹⁴

This automatically adjusts to the numeric type’s precision.

Related Methodology: Numerical Stability (Week 2) - Covered machine epsilon and catastrophic cancellation.

Next Steps

Coming up Wednesday: Constrained Optimization - Lagrange multipliers, augmented Lagrangian methods, and handling equality/inequality constraints.

Series Progress:

Week: 8/12
Posts Published: 26/~48
Playgrounds: 22 available

Tagged with: optimization