Mathematics for Artificial Intelligence : Optimization

A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Optimization (Important Pointers only)

Module IV : Optimization

I. Convex Sets and Convex Functions

1. Convex Sets

A convex set is a subset of a vector space that, for any two points within the set, the line segment connecting them is also within the set.

A set $C$ in a vector space is convex if, for any $x, y \in C$ and any $\lambda$ such that $0 \leq λ \leq 1$ , the point $\lambda x + (1 - \lambda) y$ is also in $C$ .

Properties of Convex Sets:

Intersection: The intersection of convex sets is also convex.
Convex Hull: The smallest convex set containing a given set of points is called the convex hull of those points.
Half-spaces and Hyperplanes: Half-spaces and hyperplanes are examples of convex sets.

Eg: Line segments, Circles, Polyhedrons

2. Convex Functions

A convex function is a function $f : R^{n} \to R$ where the domain is a convex set and the function satisfies the following condition: for any $x, y \in \text{dom}(f)$ and any $\lambda$ such that $0 \leq \lambda \leq 1$ ,

$f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y).$

This means that the line segment connecting any two points on the graph of the function lies above or on the graph.

Properties of Convex Functions:

Epigraph: The set of points lying on or above the graph of a convex function forms a convex set, known as the epigraph.
Local and Global Minima: Any local minimum of a convex function is also a global minimum.
Second Derivative Test: If a function is twice differentiable, it is convex if and only if its Hessian matrix is positive semi-definite.
First Order Condition: A function $f$ is convex if and only if $f(y) \geq f(x) + \nabla f(x)^\top (y - x)$ for all $x, y$ in its domain.

Eg : Linear functions, exponential functions, quadratic functions with positive semidefinite Hessian

II. Unconstrained Optimization.

The process of finding the minimum or maximum of an objective function without any restrictions on the values that the variables in the function

In mathematical terms, the problem of unconstrained optimization can be stated as:

$\min_{x \in \mathbb{R}^n} f(x)$

where $f: \mathbb{R}^n \to \mathbb{R}$ is the objective function to be minimized.

1. Methods for Unconstrained Optimization

There are various methods to solve unconstrained optimization problems, depending on the properties of the objective function, such as whether it is differentiable, convex, or has a specific structure.

Gradient Descent:
- Basic Idea: Move iteratively in the direction of the negative gradient of the function at the current point to find the minimum.
- Update Rule: $x_{k+1} = x_k - \alpha \nabla f(x_k)$ , where $\alpha$ is the step size (learning rate).
Newton's Method:
- Basic Idea: Use second-order information (Hessian matrix) to make more informed updates compared to gradient descent.
- Update Rule: $x_{k + 1} = x_{k} - [\nabla^{2} f (x_{k})]^{- 1} \nabla f (x_{k})$ , where $\nabla^2 f(x_k)$ is the Hessian matrix.
Quasi-Newton Methods:
- Basic Idea: Approximate the Hessian matrix instead of computing it explicitly to save computational cost.
- Example: BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm.
Conjugate Gradient Method:
- Basic Idea: Combine the current gradient with the previous direction to find the next search direction, improving convergence for large-scale problems.
Trust-Region Methods:
- Basic Idea: Approximate the objective function within a region around the current point and choose the next point within this region.
Derivative-Free Optimization:
- Basic Idea: Optimize functions for which derivatives are not available or are expensive to compute.
- Examples: Nelder-Mead simplex method, genetic algorithms.

Eg : Quadratic Function: $f(x) = x^T Q x + c^T x$ where $Q$ is a positive definite matrix, $c$ is a vector, and $x$ is the variable vector.

2. Important terms:

Gradient ( $\nabla f(x)$ ): A vector of partial derivatives representing the slope of the function.

Hessian ( $\nabla^2 f(x)$ ): A matrix of second-order partial derivatives representing the curvature of the function.

Critical Points: Points where the gradient is zero. These can be minima, maxima, or saddle points.

Convexity: If

f

is convex, any local minimum is also a global minimum, simplifying the optimization process.

III. Newton's Method.

Given a twice-differentiable function $f : R^{n} \to R$ , the goal is to find the point $x$ where $f$ reaches its minimum. Newton's method uses both the gradient (first derivative) and the Hessian (second derivative) of the function to find the optimal point.

Starting from an initial guess $x_0$ , the method iteratively updates the guess using the following rule:

$x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$

Here:

$x_k$ is the current point.
$\nabla f(x_k)$ is the gradient of $f$ at $x_k$ .
$\nabla^2 f(x_k)$ is the Hessian matrix of $f$ at $x_k$ .
$[\nabla^2 f(x_k)]^{-1}$ is the inverse of the Hessian matrix.

Interpretation

The gradient $\nabla f(x_k)$ gives the direction of the steepest ascent.
The Hessian $\nabla^2 f(x_k)$ provides information about the curvature of the function, which helps in adjusting the step size and direction more accurately than gradient descent.

Convergence

Newton's method converges quadratically if the function $f$ is well-behaved (i.e., if $f$ is twice continuously differentiable, and the Hessian is positive definite at the solution). Quadratic convergence means that the number of correct digits approximately doubles in each iteration, making Newton's method much faster than gradient descent near the optimum.

Steps of Newton's Method

Initialization: Choose an initial guess $x_0$ .
Iteration: Update the current guess using the update rule until convergence:
$x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$
Convergence Check: Stop if $∥ \nabla f (x_{k}) ∥$ is less than a predefined threshold, indicating that a minimum has been reached.

Eg:

Consider the function $f (x) = x^{2} - 4 x + 4$ . To find its minimum using Newton's method:

Gradient: $\nabla f(x) = 2x - 4$
Hessian: $\nabla^{2} f (x) = 2$
Update Rule: $x_{k+1} = x_k - \frac{2x_k - 4}{2} = x_k - (x_k - 2) = 2$

Starting from any initial guess $x_0$ , the method converges to $x = 2$ in one iteration since the function is quadratic and the Hessian is constant.

IV. Gradient descent and its variants.

Given a function $f: \mathbb{R}^n \to \mathbb{R}$ , the goal is to find the minimum of $f$ . Starting from an initial point $x_0$ , the algorithm updates the point iteratively using the following rule:

$x_{k+1} = x_k - \alpha \nabla f(x_k)$

Here:

$x_k$ is the current point.
$\alpha$ is the step size (also known as the learning rate).
$\nabla f(x_k)$ is the gradient of $f$ at $x_{k}$ .

Convergence

Gradient descent converges when the sequence of points $x_k$ approaches a minimum of the function $f$ . The convergence rate depends on the choice of the step size $\alpha$ . If $\alpha$ is too large, the algorithm may oscillate or diverge. If $\alpha$ is too small, convergence can be very slow.

Variants of Gradient Descent

Stochastic Gradient Descent (SGD):
- Instead of using the full gradient, SGD uses a randomly selected subset (mini-batch) of data points to compute the gradient.
- Update Rule: $x_{k+1} = x_k - \alpha \nabla f_{i_k}(x_k)$ , where $\nabla f_{i_{k}}$ is the gradient with respect to a randomly chosen data point $i_{k}$ .
Mini-Batch Gradient Descent:
- A compromise between full gradient descent and SGD, using a small subset of data points (mini-batch) to compute the gradient.
- Update Rule: $x_{k+1} = x_k - \alpha \frac{1}{m} \sum_{i=1}^m \nabla f_{i_k}(x_k)$ , where $m$ is the mini-batch size.
Momentum:
- Adds a fraction of the previous update to the current update to accelerate convergence, especially in the presence of high curvature or noisy gradients.
- Update Rule: $v_{k+1} = \beta v_k + \alpha \nabla f(x_k)$ , $x_{k+1} = x_k - v_{k+1}$ , where $v$ is the velocity and $\beta$ is the momentum parameter.
Nesterov Accelerated Gradient (NAG):
- A variant of momentum that looks ahead to the next position before computing the gradient.
- Update Rule: $v_{k+1} = \beta v_k + \alpha \nabla f(x_k - \beta v_k)$ , $x_{k+1} = x_k - v_{k+1}$ .
Adam (Adaptive Moment Estimation):
- Combines the ideas of momentum and RMSprop by maintaining moving averages of both the gradients and the squared gradients.
- Update Rule: $m_{k+1} = \beta_1 m_k + (1 - \beta_1) \nabla f(x_k)$ , $v_{k + 1} = β_{2} v_{k} + (1 - β_{2}) \nabla f (x_{k})^{2}$ , $\hat{m}_{k+1} = \frac{m_{k+1}}{1 - \beta_1^k}$ , $\hat{v}_{k+1} = \frac{v_{k+1}}{1 - \beta_2^k}$ , $x_{k+1} = x_k - \alpha \frac{\hat{m}_{k+1}}{\sqrt{\hat{v}_{k+1}} + \epsilon}$

Choosing the Right Variant

The choice of gradient descent variant depends on the specific problem, data size and computational resources. For large-scale machine learning problems, mini-batch gradient descent and adaptive methods like Adam are popular due to their efficiency and robustness.

V. Linear Programming and the Simplex Method.

1. Linear Programming

Linear programming is a mathematical method for determining the best outcome in a model whose requirements are represented by linear relationships.

A linear programming problem is typically formulated in standard form as follows:

$\text{Minimize (or Maximize)} \quad c^T x$ $\text{subject to} \quad Ax \leq b$ $x \geq 0$

Here:

$x$ is the vector of decision variables.
$c$ is the vector of coefficients for the objective function.
$A$ is the matrix of coefficients for the constraints.
$b$ is the vector of constants on the right-hand side of the constraints.

2. Simplex Method

The simplex method is an iterative algorithm for solving linear programming problems.

It starts at a vertex (feasible solution) of the polytope defined by the constraints and moves along the edges of the polytope to find the optimal solution.

Steps of the Simplex Method

Initialization:
- Convert the linear programming problem to standard form (if it is not already).
- Identify a feasible starting vertex. If no obvious starting point exists, use the two-phase simplex method.
Iterative Process:
- Choose Entering Variable: Select a non-basic variable to enter the basis, which increases the objective function value for maximization (or decreases for minimization).
- Choose Leaving Variable: Determine which basic variable will leave the basis to maintain feasibility, ensuring all constraints are still satisfied.
- Pivot: Update the basis by pivoting, which involves row operations to maintain feasibility and improve the objective function value.
Termination:
- The algorithm terminates when no improving direction can be found, indicating that the current solution is optimal.
- If the feasible region is unbounded, the objective function can be improved indefinitely, leading to an unbounded solution.
- If no feasible solution exists, the problem is infeasible.

Eg:

Consider the following linear programming problem:

$\text{Maximize} \quad z = 3x_1 + 2x_2$ $subject to$ $x_1 + x_2 \leq 4$ $2x_1 + x_2 \leq 5$ $x_1, x_2 \geq 0$

Convert to Standard Form: $Maximize z = 3 x_{1} + 2 x_{2}$ $\text{subject to}$ $x_1 + x_2 + s_1 = 4$ $2x_1 + x_2 + s_2 = 5$ $x_1, x_2, s_1, s_2 \geq 0$
Here, $s_{1}$ and $s_2$ are slack variables.
Initial Basic Feasible Solution:
- Set $x_{1} = 0$ , $x_{2} = 0$ , $s_1 = 4$ , $s_{2} = 5$ .
Iterate Using Simplex Method:
- Choose entering variable: $x_1$ (largest coefficient in the objective function).
- Determine the leaving variable: $s_{2}$ (smallest ratio of right-hand side to the coefficient of the entering variable in each constraint).
Perform the pivot operation to update the tableau and continue the iterations until no further improvement is possible.
Optimal Solution:
- The iterations continue until the optimal solution is found. In this example, the optimal solution is $x_{1} = 2$ , $x_2 = 2$ , with the maximum value of $z = 10$ .

VI. Karush-Kuhn-Tucker (KKT) Conditions.

The Karush-Kuhn-Tucker (KKT) conditions are necessary conditions for a solution to be optimal in certain types of optimization problems. These conditions generalize the method of Lagrange multipliers to handle inequality constraints as well.

Consider the following nonlinear optimization problem:

$\text{Minimize} \quad f(x)$ $subject to g_{i} (x) \leq 0, i = 1, \dots, m$ $h_j(x) = 0, \; j = 1, \ldots, p$

Here:

$f(x)$ is the objective function.
$g_i(x)$ are the inequality constraint functions.
$h_j(x)$ are the equality constraint functions.
$x \in \mathbb{R}^n$ is the vector of decision variables.

The KKT conditions are:

Primal Feasibility: $g_{i} (x^{*}) \leq 0, i = 1, \dots, m$ $h_{j} (x^{*}) = 0, j = 1, \dots, p$
Dual Feasibility: $λ_{i} \geq 0, i = 1, \dots, m$
Stationarity: $\nabla f (x^{*}) + \sum_{i = 1}^{m} λ_{i} \nabla g_{i} (x^{*}) + \sum_{j = 1}^{p} μ_{j} \nabla h_{j} (x^{*}) = 0$
Complementary Slackness: $λ_{i} g_{i} (x^{*}) = 0, i = 1, \dots, m$
Lagrange Multipliers:
- $\lambda_i$ are the Lagrange multipliers associated with the inequality constraints.
- $\mu_j$ are the Lagrange multipliers associated with the equality constraints.

Eg:

Consider the following problem:

$\text{Minimize} \quad f(x) = x_1^2 + x_2^2$ $\text{subject to} \quad x_1 + x_2 - 1 \leq 0$ $x_{1}, x_{2} \geq 0$

Primal Feasibility: $x_1^* + x_2^* - 1 \leq 0$ $x_{1}^{*}, x_{2}^{*} \geq 0$
Dual Feasibility: $\lambda \geq 0$
Stationarity: $\nabla f (x^{*}) + λ \nabla g (x^{*}) = 0$ $2x_1^* + \lambda = 0$ $2 x_{2}^{*} + λ = 0$
Complementary Slackness: $\lambda (x_1^* + x_2^* - 1) = 0$

To solve this, note that:

From stationarity: $2x_1^* = 2x_2^*$ , thus $x_1^* = x_2^*$ .
From primal feasibility: $x_1^* + x_2^* \leq 1$ .

Combining these, we get $2 x_{1}^{*} \leq 1$ , hence $x_{1}^{*} = x_{2}^{*} = 0.5$ .

Finally, check complementary slackness: $λ (0.5 + 0.5 - 1) = 0$ , which holds since $\lambda$ can be any non-negative number. In this case, $\lambda = 0$ .

Thus, the optimal solution is $x_1^* = 0.5$ , $x_2^* = 0.5$ .

VII. Lagrange Multipliers.

Lagrange multipliers are a method used in calculus to find the local maxima and minima of a function subject to equality constraints.

Consider the optimization problem:

$\text{Minimize (or Maximize)} \quad f(x)$ $\text{subject to} \quad g_i(x) = 0, \; i = 1, \ldots, m$

where:

$f(x)$ is the objective function.
$g_i(x)$ are the equality constraint functions.
$x \in \mathbb{R}^n$ is the vector of decision variables.

Lagrangian Function

To incorporate the constraints into the objective function, we define the Lagrangian function $\mathcal{L}(x, \lambda)$ :

$\mathcal{L}(x, \lambda) = f(x) + \sum_{i=1}^m \lambda_i g_i(x)$

Here:

$\lambda_i$ are the Lagrange multipliers associated with each constraint $g_i(x) = 0$ .

Lagrange Multiplier Method

The method of Lagrange multipliers states that to find the local maxima and minima of $f(x)$ subject to the constraints $g_i(x) = 0$ , we need to find the points where the gradients of $f$ and $g_i$ are parallel. This is achieved by solving the system of equations given by:

Stationarity: $\nabla_x \mathcal{L}(x, \lambda) = \nabla f(x) + \sum_{i=1}^m \lambda_i \nabla g_i(x) = 0$
Primal Feasibility: $g_i(x) = 0, \; i = 1, \ldots, m$

Eg:

Consider the problem of finding the minimum of $f (x, y) = x^{2} + y^{2}$ subject to the constraint $g (x, y) = x + y - 1 = 0$

Formulate the Lagrangian: $\mathcal{L}(x, y, \lambda) = x^2 + y^2 + \lambda (x + y - 1)$
Set Up the System of Equations:
- Compute the partial derivatives and set them to zero: $\begin{align*} \frac{\partial \mathcal{L}}{\partial x} & = 2x + \lambda = 0 \\ \frac{\partial \mathcal{L}}{\partial y} & = 2y + \lambda = 0 \\ \frac{\partial \mathcal{L}}{\partial \lambda} & = x + y - 1 = 0 \end{align*}$
Solve the System:
- From the first two equations: $λ = - 2 x$ and $λ = - 2 y$ . Thus, $2x = 2y$ or $x = y$ .
- Substitute $x = y$ into the constraint: $x + y - 1 = 0$ gives $2 x - 1 = 0$ or $x = \frac{1}{2}$ .
- Hence, $y = \frac{1}{2}$ .

So, the solution is $x = \frac{1}{2}$ , $y = \frac{1}{2}$ , and the minimum value of $f(x, y) = \left(\frac{1}{2}\right)^2 + \left(\frac{1}{2}\right)^2 = \frac{1}{2}$ .

VIII.Quadratic Programming (QP).

Quadratic programming is a type of mathematical optimization problem where the objective function is quadratic, and the constraints are linear.

A standard quadratic programming problem can be formulated as follows:

$\text{Minimize} \quad f(x) = \frac{1}{2} x^T Q x + c^T x$ $\text{subject to} \quad Ax \leq b$ $\quad x \geq 0$

Here:

$x \in R^{n}$ is the vector of decision variables.
$Q \in \mathbb{R}^{n \times n}$ is a symmetric positive semi-definite matrix, representing the quadratic part of the objective function.
$c \in R^{n}$ is the vector representing the linear part of the objective function.
$A \in \mathbb{R}^{m \times n}$ is the matrix of coefficients for the linear inequality constraints.
$b \in \mathbb{R}^m$ is the vector of constants on the right-hand side of the constraints.

KKT Conditions for Quadratic Programming

The Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for optimality in quadratic programming problems when $Q$ is positive semi-definite. For a quadratic programming problem in standard form, the KKT conditions are:

Primal Feasibility: $A x \leq b$ $x \geq 0$
Dual Feasibility: $λ \geq 0$ $\mu \geq 0$
Stationarity: $Q x + c + A^{T} λ - μ = 0$
Complementary Slackness: $\lambda_i (A_i x - b_i) = 0, \; i = 1, \ldots, m$ $μ_{i} x_{i} = 0, i = 1, \dots, n$

Here:

$\lambda \in \mathbb{R}^m$ are the Lagrange multipliers for the inequality constraints $Ax \leq b$ .
$\mu \in \mathbb{R}^n$ are the Lagrange multipliers for the non-negativity constraints $x \geq 0$ .

Eg:

Consider the following quadratic programming problem:

$\text{Minimize} \quad f(x_1, x_2) = x_1^2 + x_2^2 - 4x_1 - 2x_2$ $subject to x_{1} + x_{2} \leq 1$ $x_1, x_2 \geq 0$

Formulate the Matrices:
$Q = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}, \quad c = \begin{pmatrix} -4 \\ -2 \end{pmatrix}, \quad A = \begin{pmatrix} 1 & 1 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \end{pmatrix}$
Write the Lagrangian:
$\mathcal{L}(x, \lambda, \mu) = x_1^2 + x_2^2 - 4x_1 - 2x_2 + \lambda (x_1 + x_2 - 1) - \mu_1 x_1 - \mu_2 x_2$
Set Up the KKT Conditions:
$\begin{align*} \nabla_{x_1} \mathcal{L} &= 2x_1 - 4 + \lambda - \mu_1 = 0 \\ \nabla_{x_2} \mathcal{L} &= 2x_2 - 2 + \lambda - \mu_2 = 0 \\ x_1 + x_2 - 1 &\leq 0 \\ x_1, x_2 &\geq 0 \\ \lambda &\geq 0 \\ \mu_1, \mu_2 &\geq 0 \\ \lambda (x_1 + x_2 - 1) &= 0 \\ \mu_1 x_1 &= 0 \\ \mu_2 x_2 &= 0 \end{align*}$
Solve the System:
From the first two equations, we get:
$2x_1 - 4 + \lambda = \mu_1 \\ 2x_2 - 2 + \lambda = \mu_2$
Since $μ_{1} x_{1} = 0$ and $\mu_2 x_2 = 0$ , if $x_1 > 0$ , then $μ_{1} = 0$ , and if $x_2 > 0$ , then $μ_{2} = 0$ . Assume both $x_1$ and $x_{2}$ are positive, then $\mu_1 = \mu_2 = 0$ .
Thus, we have:
$2x_1 - 4 + \lambda = 0 \implies \lambda = 4 - 2x_1 \\ 2x_2 - 2 + \lambda = 0 \implies \lambda = 2 - 2x_2$
Equating the two expressions for $\lambda$ , we get:
$4 - 2x_1 = 2 - 2x_2 \implies 2 = 2x_1 - 2x_2 \implies x_1 = x_2 + 1$
Substituting $x_{1} = x_{2} + 1$ into the constraint $x_1 + x_2 \leq 1$ :
$(x_2 + 1) + x_2 \leq 1 \implies 2x_2 + 1 \leq 1 \implies 2x_2 \leq 0 \implies x_2 \leq 0$
Since $x_2 \geq 0$ , we conclude $x_{2} = 0$ . Therefore, $x_{1} = 1$ .
Optimal Solution:
$x_1 = 1, \; x_2 = 0$
The minimum value of $f(x_1, x_2) = 1^2 + 0^2 - 4 \cdot 1 - 2 \cdot 0 = 1 - 4 = -3$

technotes.

Search This Blog