Optimization or mathematical programming considers the problem $$ \begin{array}{ll} \text{minimize} & f(\mathbf{x}) \\ \text{subject to} & \mathbf{x} \in C \end{array} $$
Possible confusion:
Why is optimization important in statistics?
Global optimization
Local optimization
Our major goal (or learning objectives) is to
Problem $$ \begin{array}{ll} \text{minimize} & f(\mathbf{x}) \\ \text{subject to} & \mathbf{x} \in C \end{array} $$
More familiar formulation: take $V=\mathbb{R}^d$, $$ \begin{array}{ll} \text{minimize} & f_0(\mathbf{x}) \\ \text{subject to} & f_i(\mathbf{x}) \le b_i, \quad i=1, 2, \dotsc, m. \end{array} $$ where $f_i: \mathbb{R}^d \to \mathbb{R}$ are convex functions.
Why do we care about convex optimization?
Fact. Any local solution of a convex optimization problem is a global solution.
Role of convex optimization
The underlying space is a vector space $V$ unless otherwise stated.
Line segments: for given $\mathbf{x}$, $\mathbf{y}$, $$ \{\alpha \mathbf{x} + (1-\alpha)\mathbf{y}: 0 \le \alpha \le 1\}. $$
A set $C$ is convex if for every pair of points $\mathbf{x}$ and $\mathbf{y}$ lying in $C$ the entire line segment connecting them also lies in $C$.
Examples
Norm ball: $B_r(\mathbf{c})=\{\mathbf{x}: \|\mathbf{x}-\mathbf{c}\| \le r\}$ for any proper norm $\|\cdot\|$.
$\ell_p$ "norm" $\|\mathbf{x}\|_p = (\sum_{i=1}^d |x_i|^p)^{1/p}$, $0<p<1$, in $\mathbb{R}^d$ is not a proper norm.
Hyperplane: $\{\mathbf{x}: \langle \mathbf{a}, \mathbf{x} \rangle = c \}$.
A set $C$ is a cone if for each $\mathbf{x}$ in $C$, the set $\{\alpha\mathbf{x}: \alpha > 0\} \subset C$.
A cone $C$ is a convex cone if it is also convex. Equivalently, a set $C$ is a convex cone if for any $\mathbf{x}, \mathbf{y}\in C$, $\{\alpha\mathbf{x} + \beta\mathbf{y}: \alpha, \beta > 0\} \subset C$.
Examples of convex cone
Second-order cone (Lorentz cone): $\{(\mathbf{x}, t): \|\mathbf{x}\|_2 \le t \}$.
Norm cone: $\{(\mathbf{x}, t): \|\mathbf{x}\| \le t \}$, where $\|\cdot\|$ is any norm.
Example of a nonconvex cone?
The intersection of an arbitrary collection of convex, affine, or conical sets is convex, affine, conical, respectively.
Convex combination: $\{\sum_{i=1}^m \alpha_i \mathbf{x}_i: \alpha_i \ge 0, ~\sum_{i=1}^m\alpha_i = 1\}$
Convex sets are closed under convex combination: any convex combination of points from a convex set $C$ belongs to $C$
Conic combination, affine conbination are defined similarly; similar closure properties also hold.
Convex hull: the convex hull of a nonempty set $C$ is the set of all convex combinations of points in $C$:
$$
\text{conv}C = \{\sum_{i=1}^m \alpha_i \mathbf{x}_i: \mathbf{x}_i\in C,~ \alpha_i \ge 0, i=1,\dotsc,m \text{ for some }m,~\sum_{i=1}^m\alpha_i = 1 \}
$$
which is the smallest convex set containing $C$.
Conic hull $\text{cone}C$ drops the sum condition, and affine hull $\text{aff}C$ drops the nonnegativity conditions on $\alpha_i$s.
Affine dimension: $\text{dim}(C) \triangleq \text{dim}(\text{aff}C)$.
Example: simplex $$ S = \text{conv}\{\mathbf{v}_0, \mathbf{v}_1, \dotsc, \mathbf{v}_k\} $$ when $\mathbf{v}_0, \mathbf{v}_1, \dotsc, \mathbf{v}_k$ are affinely independent, i.e., $\mathbf{v}_1-\mathbf{v}_0, \dotsc, \mathbf{v}_k-\mathbf{v}_0$ are linearly independent.
Most constraint sets in optimization does not have an interior, e.g, probability simplex. It is useful to define the interior relative to the affine hull: $$ \text{relint}C = \{\mathbf{x}\in C: (\exists r>0) B(\mathbf{x}, r) \cap \text{aff}C \subset C \} $$
Recall that a real-valued function $f$ is convex if $$ f(\alpha\mathbf{x} + (1-\alpha)\mathbf{y}) \le \alpha f(\mathbf{x}) + (1-\alpha) f(\mathbf{y}), \quad \forall\mathbf{x}, \mathbf{y} \in \text{dom}f, \forall\alpha \in [0, 1], $$
Extended-value functions: it is often convenient to extend the domain of $f$ to the whole $V$ and allow to have the value $\infty$. Then $f:V \to \mathbb{R}\cup\{\infty\}$ and $$ \text{dom} f = \{\mathbf{x}: f(\mathbf{x}) < \infty \} $$ is the essential domain of $f$.
This extension allows us to consider the indicator function of a set: $$ \iota_C(\mathbf{x}) = \begin{cases} 0 & \mathbf{x} \in C \\ \infty & \mathbf{x} \notin C \end{cases} $$ so that a constrained optimization problem is converted to an unconstrained problem: $$ \min_{\mathbf{x}\in C} f(\mathbf{x}) = \min_{\mathbf{x}} f(\mathbf{x}) + \iota_C(\mathbf{x}) $$
Properness: a function $f$ is proper if $\text{dom}f \neq \emptyset$ and $f(\mathbf{x}) > -\infty$ for all $\mathbf{x}$.
Examples
If $f$ is differentiable (i.e., its gradient $\nabla f$ exists at each point in $\text{dom}f$, which is open), then $f$ is convex if and only if $\text{dom} f$ is convex and $$ f(\mathbf{y}) \ge f(\mathbf{x}) + \langle \nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x} \rangle $$ for all $\mathbf{x}, \mathbf{y} \in \text{dom}f$.
$f$ is strictly convex if and only if strict inequality holds for all $\mathbf{y} \neq \mathbf{x}$.
The epigraph of a function $f$ is the set $$ \text{epi}f = \{(\mathbf{x}, t): \mathbf{x}\in\text{dom}f,~ f(\mathbf{x}) \le t \}. $$
A function $f$ is convex if and only if $\text{epi}f$ is convex.
If $(\mathbf{y}, t)\in\text{epi}f$, then from the supporting hyperplance inequality, $$ t \ge f(\mathbf{y}) \ge f(\mathbf{x}) + \langle \nabla f(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle $$ or $$ \left\langle (\nabla f(\mathbf{x}), -1), (\mathbf{y}, t) - (\mathbf{x}, f(\mathbf{x})) \right\rangle \le 0. $$ This means that the hyperplane defined by $(\nabla f(\mathbf{x}),−1)$ supports $\text{epi}f$ at the boundary point $(\mathbf{x},f(\mathbf{x}))$:
An extended-value function $f$ is called closed if $\text{epi}f$ is closed.
$\alpha$-sublevel set of an extended-value function $f$ is $$ S_{\alpha} = \{\mathbf{x}\in\text{dom}f : f(\mathbf{x}) \le \alpha \}. $$
If $f$ is convex, then $S_{\alpha}$ is convex for all $\alpha$.
Further if $f$ is continuous, then all sublevel sets are closed.
(Nonnegative weighted sums) If $f$ and $g$ are convex and $\alpha$ and $\beta$ are nonnegative constants, then $\alpha f + \beta g$ is convex, with $\text{dom}f \cap \text{dom}g$.
(Composition with affine mapping) If $f$ is convex, then composition $f(\mathbf{A}\mathbf{x} + \mathbf{b})$ of $f$ with an affine map $\mathbf{x}\mapsto \mathbf{A}\mathbf{x} + \mathbf{b}$ is convex, with $\text{dom}g=\{\mathbf{x}:\mathbf{Ax}+\mathbf{b}\in\text{dom}f\}$.
(Pointwise maximum and supremum) If $f_i$ is convex for each fixed $i=1,\dotsc,m$, then $g(\mathbf{x}) = \max_{i=1,\dotsc,m} f_i(\mathbf{x})$ is convex, with $\text{dom}g = \cap_{i=1}^m\text{dom}f_i$.
(Composition) For $h:\mathbb{R}\to\mathbb{R}\cup\{\infty\}$ and $g:V\to\mathbb{R}\cup\{\infty\}$, if $h$ is convex and nondecreasing, and $g$ is convex, then $f = h \circ g$ is convex, with $\text{dom}f = \{\mathbf{x}\in\text{dom}g: g(\mathbf{x})\in\text{dom}h\}$.
(Paritial minimization) If $f(\mathbf{x},\mathbf{y})$ is jointly convex in $(\mathbf{x},\mathbf{y})$, then $g(\mathbf{x}) = \inf_{\mathbf{y}\in C} f(\mathbf{x},\mathbf{y})$ is convex provided it is proper and $C$ is nonempty convex, with $\text{dom}f = \{\mathbf{x}: (\exists\mathbf{y}\in C)(\mathbf{x},\mathbf{y})\in\text{dom}f\}$ (projection).
(Perspective) If $f(\mathbf{x})$ is convex and finite, then its perspective $g(\mathbf{x}, t)=t f(t^{-1}\mathbf{x})$ is convex, with $\text{dom}g=\{(\mathbf{x},t): t^{-1}\mathbf{x}\in\text{dom}f,~ t>0\}$.
A sum of convex functions is convex.
$f(\mathbf{x})=x_{(1)}+x_{(2)}+\dotsb+x_{(k)}$, the some of $k$ largest components of $\mathbf{x}\in\mathbb{R}^d$, is convex, because $$ f(\mathbf{x})=\max\{x_{i_1} +\dotsb + x_{i_k} : 1\le i_1 <i_2 < \dotsb < i_k \le n\}. $$
Maximum eigenvalue of a symmetric matrix $\lambda_{\max}: \mathbb{S}^d \to \mathbb{R}$ is convex, because $$ \lambda_{\max}(\mathbf{X}) = \max_{\mathbf{v}^T\mathbf{v}=1} \mathbf{v}^T\mathbf{X}\mathbf{v}. $$ (Rayleigh quotient). The maximand is linear (hence convex) in $\mathbf{X}$ for each $\mathbf{v}$.
Sum of $k$ largest eigenvalues of a symmetric matrix is convex, because $$ \sum_{i=1}^k\lambda_{i}(\mathbf{X}) = \max_{\mathbf{V}^T\mathbf{V}=\mathbf{I}_k, \mathbf{V}\in\mathbb{R}^{d\times k}} \text{tr}(\mathbf{V}^T\mathbf{X}\mathbf{V}) $$ (Ky Fan, 1949).
Support function of a set: $\sigma_C(\mathbf{x}) = \sup_{\mathbf{y}\in C} \langle \mathbf{x}, \mathbf{y} \rangle$ is convex (with an obvious domain), because $\langle \mathbf{x}, \mathbf{y} \rangle$ is convex (linear) in $\mathbf{x}$ for each $\mathbf{y}\in C$.
Matrix perspective function $f(\mathbf{x},\mathbf{Y}) = \mathbf{x}^T\mathbf{Y}^{-1}\mathbf{x}$ is convex with $\text{dom}f=\mathbb{R}^d\times\mathbb{S}_{++}^d$ because $$ \mathbf{x}^T\mathbf{Y}^{-1}\mathbf{x} = \sup_{\mathbf{z}\in\mathbb{R}^d}\{2\mathbf{x}^T\mathbf{z} - \mathbf{z}^T\mathbf{Y}\mathbf{z}\}. $$
Kullback-Leibler divergence of two $d$-dimensional probability vectors $$ D(\mathbf{x} || \mathbf{y}) = \sum_{i=1}^d x_i\log(x_i/y_i) $$ (taking $0\log 0= 0$ and $\log 0=\infty$ by continuity) is convex on $\Delta_{d-1}\times\Delta_{d-1}$, because