A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Optimization (Important Pointers only)
Module IV : Optimization
I. Convex Sets and Convex Functions
1. Convex Sets
A convex set is a subset of a vector space that, for any two points within the set, the line segment connecting them is also within the set.
A set in a vector space is convex if, for any and any such that , the point is also in .
Properties of Convex Sets:
- Intersection: The intersection of convex sets is also convex.
- Convex Hull: The smallest convex set containing a given set of points is called the convex hull of those points.
- Half-spaces and Hyperplanes: Half-spaces and hyperplanes are examples of convex sets.
Eg: Line segments, Circles, Polyhedrons
2. Convex Functions
A convex function is a function where the domain is a convex set and the function satisfies the following condition: for any and any such that ,
This means that the line segment connecting any two points on the graph of the function lies above or on the graph.
Properties of Convex Functions:
- Epigraph: The set of points lying on or above the graph of a convex function forms a convex set, known as the epigraph.
- Local and Global Minima: Any local minimum of a convex function is also a global minimum.
- Second Derivative Test: If a function is twice differentiable, it is convex if and only if its Hessian matrix is positive semi-definite.
- First Order Condition: A function is convex if and only if for all in its domain.
Eg : Linear functions, exponential functions, quadratic functions with positive semidefinite Hessian
II. Unconstrained Optimization.
The process of finding the minimum or maximum of an objective function without any restrictions on the values that the variables in the function
In mathematical terms, the problem of unconstrained optimization can be stated as:
where is the objective function to be minimized.
1. Methods for Unconstrained Optimization
There are various methods to solve unconstrained optimization problems, depending on the properties of the objective function, such as whether it is differentiable, convex, or has a specific structure.
- Gradient Descent:
- Basic Idea: Move iteratively in the direction of the negative gradient of the function at the current point to find the minimum.
- Update Rule: , where is the step size (learning rate).
- Newton's Method:
- Basic Idea: Use second-order information (Hessian matrix) to make more informed updates compared to gradient descent.
- Update Rule: , where is the Hessian matrix.
- Quasi-Newton Methods:
- Basic Idea: Approximate the Hessian matrix instead of computing it explicitly to save computational cost.
- Example: BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm.
- Conjugate Gradient Method:
- Basic Idea: Combine the current gradient with the previous direction to find the next search direction, improving convergence for large-scale problems.
- Basic Idea: Combine the current gradient with the previous direction to find the next search direction, improving convergence for large-scale problems.
- Trust-Region Methods:
- Basic Idea: Approximate the objective function within a region around the current point and choose the next point within this region.
- Basic Idea: Approximate the objective function within a region around the current point and choose the next point within this region.
- Derivative-Free Optimization:
- Basic Idea: Optimize functions for which derivatives are not available or are expensive to compute.
- Examples: Nelder-Mead simplex method, genetic algorithms.
Eg : Quadratic Function: where is a positive definite matrix, is a vector, and is the variable vector.
2. Important terms:
III. Newton's Method.
Given a twice-differentiable function , the goal is to find the point where reaches its minimum. Newton's method uses both the gradient (first derivative) and the Hessian (second derivative) of the function to find the optimal point.
Starting from an initial guess , the method iteratively updates the guess using the following rule:
Here:
- is the current point.
- is the gradient of at .
- is the Hessian matrix of at .
- is the inverse of the Hessian matrix.
Interpretation
- The gradient gives the direction of the steepest ascent.
- The Hessian provides information about the curvature of the function, which helps in adjusting the step size and direction more accurately than gradient descent.
Convergence
Newton's method converges quadratically if the function is well-behaved (i.e., if is twice continuously differentiable, and the Hessian is positive definite at the solution). Quadratic convergence means that the number of correct digits approximately doubles in each iteration, making Newton's method much faster than gradient descent near the optimum.
Steps of Newton's Method
- Initialization: Choose an initial guess .
- Iteration: Update the current guess using the update rule until convergence:
- Convergence Check: Stop if is less than a predefined threshold, indicating that a minimum has been reached.
Eg:
Consider the function . To find its minimum using Newton's method:
Gradient:
Hessian:
Update Rule:
Starting from any initial guess , the method converges to in one iteration since the function is quadratic and the Hessian is constant.
IV. Gradient descent and its variants.
Given a function , the goal is to find the minimum of . Starting from an initial point , the algorithm updates the point iteratively using the following rule:
Here:
- is the current point.
- is the step size (also known as the learning rate).
- is the gradient of at .
Convergence
Gradient descent converges when the sequence of points approaches a minimum of the function . The convergence rate depends on the choice of the step size . If is too large, the algorithm may oscillate or diverge. If is too small, convergence can be very slow.
Variants of Gradient Descent
Stochastic Gradient Descent (SGD):
- Instead of using the full gradient, SGD uses a randomly selected subset (mini-batch) of data points to compute the gradient.
- Update Rule: , where is the gradient with respect to a randomly chosen data point .
Mini-Batch Gradient Descent:
- A compromise between full gradient descent and SGD, using a small subset of data points (mini-batch) to compute the gradient.
- Update Rule: , where is the mini-batch size.
Momentum:
- Adds a fraction of the previous update to the current update to accelerate convergence, especially in the presence of high curvature or noisy gradients.
- Update Rule: , , where is the velocity and is the momentum parameter.
Nesterov Accelerated Gradient (NAG):
- A variant of momentum that looks ahead to the next position before computing the gradient.
- Update Rule: , .
- Adam (Adaptive Moment Estimation):
- Combines the ideas of momentum and RMSprop by maintaining moving averages of both the gradients and the squared gradients.
- Update Rule: , , , ,
Choosing the Right Variant
The choice of gradient descent variant depends on the specific problem, data size and computational resources. For large-scale machine learning problems, mini-batch gradient descent and adaptive methods like Adam are popular due to their efficiency and robustness.
V. Linear Programming and the Simplex Method.
1. Linear Programming
Linear programming is a mathematical method for determining the best outcome in a model whose requirements are represented by linear relationships.
A linear programming problem is typically formulated in standard form as follows:
Here:
- is the vector of decision variables.
- is the vector of coefficients for the objective function.
- is the matrix of coefficients for the constraints.
- is the vector of constants on the right-hand side of the constraints.
2. Simplex Method
The simplex method is an iterative algorithm for solving linear programming problems.
It starts at a vertex (feasible solution) of the polytope defined by the constraints and moves along the edges of the polytope to find the optimal solution.
Steps of the Simplex Method
Initialization:
- Convert the linear programming problem to standard form (if it is not already).
- Identify a feasible starting vertex. If no obvious starting point exists, use the two-phase simplex method.
Iterative Process:
- Choose Entering Variable: Select a non-basic variable to enter the basis, which increases the objective function value for maximization (or decreases for minimization).
- Choose Leaving Variable: Determine which basic variable will leave the basis to maintain feasibility, ensuring all constraints are still satisfied.
- Pivot: Update the basis by pivoting, which involves row operations to maintain feasibility and improve the objective function value.
Termination:
- The algorithm terminates when no improving direction can be found, indicating that the current solution is optimal.
- If the feasible region is unbounded, the objective function can be improved indefinitely, leading to an unbounded solution.
- If no feasible solution exists, the problem is infeasible.
Eg:
Consider the following linear programming problem:
Convert to Standard Form:
Here, and are slack variables.
Initial Basic Feasible Solution:
- Set , , , .
Iterate Using Simplex Method:
- Choose entering variable: (largest coefficient in the objective function).
- Determine the leaving variable: (smallest ratio of right-hand side to the coefficient of the entering variable in each constraint).
Perform the pivot operation to update the tableau and continue the iterations until no further improvement is possible.
Optimal Solution:
- The iterations continue until the optimal solution is found. In this example, the optimal solution is , , with the maximum value of .
VI. Karush-Kuhn-Tucker (KKT) Conditions.
The Karush-Kuhn-Tucker (KKT) conditions are necessary conditions for a solution to be optimal in certain types of optimization problems. These conditions generalize the method of Lagrange multipliers to handle inequality constraints as well.
Consider the following nonlinear optimization problem:
Here:
- is the objective function.
- are the inequality constraint functions.
- are the equality constraint functions.
- is the vector of decision variables.
The KKT conditions are:
Primal Feasibility:
Dual Feasibility:
Stationarity:
Complementary Slackness:
Lagrange Multipliers:
- are the Lagrange multipliers associated with the inequality constraints.
- are the Lagrange multipliers associated with the equality constraints.
Eg:
Consider the following problem:
Primal Feasibility:
Dual Feasibility:
Stationarity:
Complementary Slackness:
To solve this, note that:
- From stationarity: , thus .
- From primal feasibility: .
Combining these, we get , hence .
Finally, check complementary slackness: , which holds since can be any non-negative number. In this case, .
Thus, the optimal solution is , .
VII. Lagrange Multipliers.
Lagrange multipliers are a method used in calculus to find the local maxima and minima of a function subject to equality constraints.
Consider the optimization problem:
where:
- is the objective function.
- are the equality constraint functions.
- is the vector of decision variables.
Lagrangian Function
To incorporate the constraints into the objective function, we define the Lagrangian function :
Here:
- are the Lagrange multipliers associated with each constraint .
Lagrange Multiplier Method
The method of Lagrange multipliers states that to find the local maxima and minima of subject to the constraints , we need to find the points where the gradients of and are parallel. This is achieved by solving the system of equations given by:
Stationarity:
Primal Feasibility:
Eg:
Consider the problem of finding the minimum of subject to the constraint
Formulate the Lagrangian:
Set Up the System of Equations:
- Compute the partial derivatives and set them to zero:
Solve the System:
- From the first two equations: and . Thus, or .
- Substitute into the constraint: gives or .
- Hence, .
So, the solution is , , and the minimum value of .
VIII.Quadratic Programming (QP).
Quadratic programming is a type of mathematical optimization problem where the objective function is quadratic, and the constraints are linear.
A standard quadratic programming problem can be formulated as follows:
Here:
- is the vector of decision variables.
- is a symmetric positive semi-definite matrix, representing the quadratic part of the objective function.
- is the vector representing the linear part of the objective function.
- is the matrix of coefficients for the linear inequality constraints.
- is the vector of constants on the right-hand side of the constraints.
KKT Conditions for Quadratic Programming
The Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for optimality in quadratic programming problems when is positive semi-definite. For a quadratic programming problem in standard form, the KKT conditions are:
Primal Feasibility:
Dual Feasibility:
Stationarity:
Complementary Slackness:
Here:
- are the Lagrange multipliers for the inequality constraints .
- are the Lagrange multipliers for the non-negativity constraints .
Eg:
Consider the following quadratic programming problem:
Formulate the Matrices:
Write the Lagrangian:
Set Up the KKT Conditions:
Solve the System:
From the first two equations, we get:
Since and , if , then , and if , then . Assume both and are positive, then .
Thus, we have:
Equating the two expressions for , we get:
Substituting into the constraint :
Since , we conclude . Therefore, .
Optimal Solution:
The minimum value of