Smoothed Analysis for the Conjugate Gradient Algorithm

The purpose of this paper is to establish bounds on the rate of convergence of the conjugate gradient algorithm when the underlying matrix is a random positive definite perturbation of a deterministic positive definite matrix. We estimate all finite moments of a natural halting time when the random perturbation is drawn from the Laguerre unitary ensemble in a critical scaling regime explored in Deift et al. (2016). These estimates are used to analyze the expected iteration count in the framework of smoothed analysis, introduced by Spielman and Teng (2001). The rigorous results are compared with numerical calculations in several cases of interest.


Introduction
It is conventional in numerical analysis to study the worst-case behavior of algorithms, though it is often the case that worst-case behavior is far from typical. A fundamental example of this nature is the behavior of LU factorization with partial pivoting. While the worst-case behavior of the growth factor in LU factorization with partial pivoting is exponential, the algorithm works much better in practice on 'typical' problems. The notion of smoothed analysis was introduced by Spielman and his co-workers to distinguish between the typical-case and worstcase performance for numerical algorithms in such situations (see [16,18]). In recent work, also motivated by the distinction between typical and worst case behavior, the authors (along with P. Deift, S. Olver and C. Pfrang) investigated the behavior of several numerical algorithms with random input [2,14] (see also [15]). These papers differ from smoothed analysis in the sense that 'typical performance' was investigated by viewing the algorithms as dynamical systems acting on random input. The main observation in our numerical experiments was an empirical universality of fluctuations for halting times. Rigorous results on universality have now been established in two cases -the conjugate gradient algorithm and certain eigenvalue algorithms [3,4]. The work of [3] presents a true universality theorem. The work [4] revealed the unexpected emergence of Tracy-Widom fluctuations around the smallest eigenvalue of LUE matrices in a regime where universality emerges. Thus, the analysis of a question in probabilistic numerical analysis led to a new discovery in random matrix theory.
Our purpose in this paper is to explore the connections between our work and smoothed analysis. We show that the results of [4] extend to a smoothed analysis of the conjugate gradient algorithm over strictly positive definite random perturbations (which constitute a natural class of perturbations for the conjugate gradient algorithm). To the best of our knowledge, this is the first instance of smoothed analysis for the conjugate gradient algorithm. More precisely, the results of [4] are used to establish rigorous bounds on the expected value of a halting time (Theorem 1.1 below). These bounds are combined with numerical experiments that show an interesting improvement of the conjugate gradient algorithm when it is subjected to random perturbations. In what follows, we briefly review the conjugate gradient algorithm, smoothed analysis and the Laguerre unitary ensemble, before stating the main theorem, and illustrating it with numerical experiments.

The conjugate gradient algorithm
The conjugate gradient algorithm is a Krylov subspace method to solve the linear system Ax = b when A > 0 is a positive definite matrix. In this article, we focus on Hermitian positive definite matrices acting on C N , though the ideas extend to real, symmetric positive definite matrices.
We use the 2 inner product on C N , u, v 2 = N i=1 u ivi , and A > 0 means that A is Hermitian and u, Au 2 > 0 for all u = 0. When A > 0, its inverse A −1 > 0, and we may define the norms u 2 w = u, Au 2 and u 2 w −1 = u, A −1 u 2 .
In this setting, the simplest formulation of the conjugate gradient algorithm is as follows [8,10]. In order to solve Ax = b, we define the increasing sequence of Krylov subspaces K k = span b, Ab, . . . , A k−1 b , and choose the iterates {x k } ∞ k=1 , x 0 = 0, to minimize the residual r k = b − Ax k in the w −1 norm: Since A > 0, x ∈ K k for some k ≤ N (in our random setting it follows that K N = C N with probability 1), so that the method takes at most N steps in exact artithmetic 1 . However, the residual decays exponentially fast, and a useful approximation is obtained in much fewer than N steps. Let λ max and λ min denote the largest and smallest eigenvalues of A and κ = λ max /λ min the condition number. Then the rate of convergence in the 2 and w −1 norms is [7, Theorem 10.2.6] Since A is positive definite, we have Applying this estimate to (1.1) we find the rate of convergence of the residual in the 2 norm These rates of convergence provide upper bounds on the following -dependent run times, which we call halting times: Note that we have set x 0 = 0 so that r 0 = b (the estimates above hold for arbitrary x 0 ). In what follows, we will also assume that b 2 = 1, so that the definitions above simplify further.

Smoothed analysis
Our main results are a theorem (Theorem 1.1) along with numerical evidence to demonstrate that the above worst-case estimates, can be used to obtain bounds on average-case behavior in the sense of smoothed analysis. In order to state the main result, we first review two basic examples of smoothed analysis [18], since these examples clarify the context of our work. Roughly speaking, the smoothed analysis of a deterministic algorithm proceeds as follows. Given a deterministic problem, we perturb it randomly, compute the expectation of the run-time for the randomly perturbed problem and then take the maximum over all deterministic problems within a fixed class. Subjecting a deterministic problem to random perturbations provides a realistic model of 'typical performance', and by taking the maximum over all deterministic problems within a natural class, we retain an important aspect of worst-case analysis. A parameter σ 2 (the variance in our examples) controls the magnitude of the random perturbation. The final estimate of averaged run-time should depend explicitly on σ 2 in way that demonstrates that the average run-time is much better than the worst-case. Let us illustrate this idea with examples.

Smoothed analysis: The simplex algorithm
AssumeĀ = (ā 1 ,ā 2 , . . . ,ā d ) is a deterministic matrix of size N ×d, andȳ and z are deterministic vectors of size N and d, respectively. Let T (Ā,ȳ, z) be the number of simplex steps required to solve the linear program maximize z T x, subject toĀx ≤ȳ, with the two-phase shadow-vertex simplex algorithm.
We subject the dataĀ andȳ to a random perturbation σA, and σy, where A and y have iid normal entries with mean zero and standard deviation max i (ȳ i ,ā i ) . It is then shown in [19] that the expected number of simplex steps is controlled by where P (a, b, c) is a polynomial. Thus, problems of polynomial complexity occupy a region of high probability.

Smoothed analysis: LU factorization without pivoting
LetĀ be an N × N non-singular matrix and consider computing its LU factorization,Ā = LU , without partial pivoting. The growth factor ofĀ, defined by may be exponentially large in the size of the matrix, as seen in the following classical example: Generalizing this example to all N , we see that ρ(Ā) = 2 N −1 /N . This is close to the worst-case estimate of Wilkinson [22]. Now consider instead ρ(Ā+σA) where the random perturbation A is an N × N matrix consisting of iid standard normal random variables. One of the results of [16] is Hence the probability that ρ(Ā + σA) ≥ 2 N −1 /N is exponentially small! The above estimate relies on a tail bound on the condition number 14.1n 1 + 2(log t)/9n tσ .
The example above may also be used to demonstrate exponential growth with partial pivoting. However, to the best of our knowledge, there are no smoothed analysis bounds analogous to (1.4) that include the effect of pivoting.

The main result
We now formulate a notion of smoothed analysis for the halting time of the conjugate gradient algorithm. In order to do so, we must choose a matrix ensemble over which to take averages. Since the conjugate gradient algorithm is restricted to positive definite matrices it is natural to choose random perturbations that are also positive definite. The fundamental probability measure on Hermitian positive definite matrices is the Laguerre unitary ensemble (LUE), or Wishart ensemble, defined as follows. Assume N is a positive integer and α > −N is another integer. Let X be an N × (N + α) matrix of iid standard complex normal random variables 2 . The Hermitian matrix W = XX * is an LUE matrix with parameter 1 + α/N .
The parameter α plays an important role in our work. The case α = 0 is critical in the following sense. When −N < α < 0, the random matrix W = XX * is positive semi-definite and 0 is an eigenvalue of multiplicity −α with probability 1. In particular, the condition number of W is infinite almost surely. On the other hand, when α ≥ 0, the random matrix W is almost surely strictly positive definite. When α = 0, Edelman [5] showed that the condition number of W is heavy-tailed, and does not have a finite mean (see also [16] and the previous examples). On the other hand, if α grows linearly with N , say lim N →∞ α/N = p > 0, the leading-order 2 A standard complex normal random variable is given by Z = X + iY where X and Y are independent real normal random variables with mean zero and variance 1/2. asymptotics of the smallest eigenvalue of W , and thus the condition number, are described by the Marcenko-Pastur distribution with parameter p. In particular, as N → ∞ the smallest eigenvalue of W (N )/(N + α) remains strictly separated from 0. In recent work with P. Deift, we explored an intermediate regime α ∼ 4cN 1/2 , and established Tracy-Widom [21] fluctuations of the smallest eigenvalue and the condition number (see [4, Theorems 1.1 and 1.3]). We further showed numerically that the nontrivial fluctuations of the condition number are reflected in the performance of the conjugate gradient algorithm on Wishart matrices in this regime. In this article, we broaden our exploration of this intermediate regime, choosing In order to formulate a notion of smoothed analysis for the conjugate gradient algorithm, we must subject a deterministic positive-definite matrix A with A ≤ 1 to a random perturbation of the form σ 2 H, where H = O(1), and then take the supremum over all A with A ≤ 1. It turns out that the largest eigenvalue of W = XX * is approximately ν = 4N + 2α + 2. Thus, our implementation of smoothed analysis for the conjugate gradient algorithm involves estimating with explicit dependence on σ and γ. The factor σ 2 is used here so that σ represents the scaling of the variance of the entries of X. Our main result, proved in Section 3.2, is the following.
and X is an N × (N + α) matrix of iid standard complex normal random variables. Then with we have the following estimates.
(1) Halting time with the 2 norm: (2) Halting time with the weighted norm: (3) Successive residuals: For r k = r k (A + σ 2 H, b), the kth residual in the solution of (A + σ 2 H)x = b with the conjugate gradient algorithm where · is either · 2 or · w −1 .
Remark 1.2. The parameters (γ, σ) control the effect of the random perturbation in very different ways. In Lemma 3.4 we precisely describe how increasing γ leads to better conditioned problems. For all 0 < γ ≤ 1/2 (and conjecturally for γ > 1/2) the asymptotic size of the expectation and the standard deviation of τ is O(N 1−γ log N ), meaning that the conjugate gradient algorithm will terminate before its maximum of N iterations with high probability. For instance, assume α satisfies (1.5) with c = 2. We use Markov's inequality and Theorem 1.1 for j > 0 and sufficiently large 3 N to obtain Hence for λ < γ and j large, this probability decays rapidly. Remark 1.3. We only prove Theorem 1.1 for LUE perturbations in the range 0 < γ ≤ 1/2. However, we expect Theorem 1.1 to hold for all 0 ≤ γ ≤ 1, as illustrated in the numerical experiments below. In order to establish Theorem 1.1 in the range 1 2 < γ ≤ 1 it is only necessary to establish Lemma 3.4 for these values of γ. This will be the focus of future work. Remark 1.4. Theorem 1.1 provides aymptotic control on the jth moments of halting times for each j. This formally suggests that one may obtain a bound on an exponential generating function of the halting times above. However, we cannot establish this because the condition number κ has only O(α) moments at any finite N . Remark 1.5. By restricting attention to positive definite perturbations we ensure that the conjugate gradient scheme is always well-defined for the perturbed matrix A + σ 2 H. This also allows the following simple, but crucial lower bound, on the lowest eigenvalue of the perturbed matrix which then yields an upper bound on the condition number of the perturbed matrix κ(A+σ 2 H). We have not considered the question of random perturbations of A that are Hermitian, but not necessarily positive definite. Such perturbations are more subtle since they must be scaled according to the smallest eigenvalue of A. Nor have we considered the question of whether such perturbations provide good 'real-life' models of a smoothed analysis of the conjugate gradient scheme. Nevertheless, the above framework shares important features with [16] in that the problem is "easier" for large values of σ and the worst case of the supremum over the set {A ≥ 0, A ≤ 1} can be realized at singular A.

Numerical simulations and the accuracy of the estimates
In this section we investigate how close our estimates on E[τ (A + σ 2 H, b)] are to the true value of the expectation. We present numerical evidence that in the "σ = ∞" (also obtained by A = 0 and σ = 1) case the estimates are better for larger values of γ, and continue to hold beyond the γ = 1/2 threshold of Theorem 1.1. We also give examples for specific choices of A and demonstrate that, as expected, the actual behavior of the conjugate gradient algorithm is much more complicated for A = 0.
Because the conjugate gradient algorithm is notoriously affected by round-off error, we adopt the following approach to simulating τ (M, b), M = A + σ 2 H with finite-precision arithmetic: • In exact arithmetic, the conjugate gradient algorithm applied to M x = b, M = U * ΛU , with initial guess x 0 = 0, has the same residuals as the algorithm applied to Λy = U * b.
This is an exact characterization of the iterates of the conjugate gradient algorithm applied to Λy = U * b.
Sample a vector b with iid Gaussian entries and normalize 4 it, so that b 2 = 1. Prior to normalization the entries of b are iid Gaussian, thus b is uniformly distributed on the unit is also uniformly distributed on the unit sphere in (C N , 2 ). That is, b and U * b have the same law.
• Applying the diagonal matrix Λ to a vector is much less prone to round-off error since it involves only N multiplications, as opposed to N 2 multiplications for the dense matrix H. Thus, to minimize round-off error we compute the iterates of the conjugate gradient algorithm applied to Λy = b with Λ as above, and b uniformly distributed on the unit sphere in (C N , 2 ). As noted above when A = 0, these iterates have the same law as those of Hx =b whenb and b have the same law, and when H is a Wishart matrix. By computing the number of iterations necessary (in high-precision arithmetic) so that y k+1 2 ≤ , we obtain one sample of the halting time τ (M, b) without significant round-off errors. (1) Tail estimates on the condition number derived from tail estimates of the extreme eigenvalues, can be used to obtain near optimal, and in some cases optimal, estimates for the expected moments of the condition number.
(2) In light of rigorous results and heuristic expectations of universality in random matrix theory, we find it reasonable to expect Lemma 3.4 and Theorem 1.1 to hold for more general real and complex sample covariance matrices, not just LUE. importance of this observation is that these bounds are known to be sub-optimal. Thus, our results show that the matrices for which these estimates are sub-optimal have a small probability of occurrence.

Perturbed discrete Laplacian
The numerical examples of the previous section are dominated by noise. In this subsection and the next, we investigate the effect of small LUE perturbations on structured matrices A. This is a more subtle problem since it is hard to conjecture the growth rate of E[τ (A + σ 2 H, b)] as σ and γ vary for a given A. We present numerical experiments on random perturbations of two examples that have been studied in the literature on the conjugate gradient algorithm -   discrete Laplacians and singular matrices with clusters of eigenvalues. In both these examples, we numerically estimate the growth with N of the halting time τ (A + σ 2 H, b) for a range of σ and γ. These numerical computations are compared with the unperturbed (σ = 0) and noise dominated ("σ = ∞") cases. Broadly, we observe that finite noise gives faster convergence (smaller halting time) with a different scaling than what is expected with no noise. We also find that when A = 0, the halting time is not strongly affected by γ. At present, these are numerical observations, not theorems. We hope to investigate the accelerated convergence provided by noise in future work. In our first example, A is the mk × mk 2D discrete Laplacian defined by the Kronecker product where D 2,m is the m × m symmetric tridiagonal matrix with −2 on the diagonal and 1 on the off-diagonals. We choose m = k = √ N in the computations. Some results of numerical experiments with this choice of A are shown in Fig. 4. The scaling of the sample mean of the halting time, τ , is O( √ N ) in the extreme cases when σ = 0 or σ = ∞ (see + and 2 in Fig. 4). However, when σ is O(1), we find that τ ∼ N 1/4 (see • and × in Fig. 4). Further, this result is not sensitive to γ. Therefore there is a complicated relationship between the deterministic matrix A, the random perturbation H and the halting time that is not captured by Theorem 1.1.

Perturbed eigenvalue "clusters"
In our second example, we consider random perturbations of a singular matrix with clusters of eigenvalues. This construction is motivated by [11, Section 5.6.5] and [9].
We define A to be the mk ×mk diagonal matrix obtained by sampling the Marchenko-Pastur law as follows 5 . Let ζ j,k , j = 1, . . . , k be defined by This produces a mk × mk diagonal matrix with m zero eigenvalues, and k(m − 1) eigenvalues that are each clustered at quantiles of the Marchenko-Pastur law. Note that and so ζ 1,k = O(k −2 ). We divide by k 2 in (1.6) to ensure we maintain a positive semi-definite matrix. As in the previous section we set m = k = √ N in the computations and plot similar sample mean results in Fig. 5. Despite the fact that M m,k is a singular matrix, the conjugate gradient algorithm converges rapidly for the perturbed matrix M m,k + σ 2 H. In particular, Fig. 5(b) shows a rate of growth that is only O( √ log N ). Finally, in the construction of M m,k we imposed the condition that m eigenvalues are zero. If we considered a case where more and more eigenvalues are set to zero as m, k → ∞ we would expect a transition to the σ = ∞ case.
2 Estimating the halting time
Note that log θ(κ) < 0 and that as κ → ∞, Thus, basic convergence properties of the conjugate gradient algorithm may be obtained from tail bounds on the condition number. Finally, the condition number is estimated as follows.
Since κ = λ max /λ min it is clear that upper bounds of the form P(λ max > t) and P(λ −1 min > t) for arbitrary t ∈ (0, ∞) may be combined to yield an upper bound on P(κ > a) by suitably choosing t(a). As noted in the first three lines of the proof of Theorem 1.1 below, estimates of upper and lower eigenvalues for Wishart ensembles established in [4] immediately extend to estimates for matrices of the form A + σ 2 H.

A general suf f icient condition
The abstract property we use to establish Theorem 1.1 is the following. Condition 2.1. Given a random positive-definite matrix H, assume there exist positive constants c 1 and δ, constants C 1 , C 2 , and a that are greater than 1, and a positive function f : (0, ∞) → (0, ∞) such that Assume further that T max / min are strictly monotone functions of t and lim N →∞ While the conditions above seem arbitrary at first sight, we will show how they emerge naturally for the LUE ensemble in the next section. In particular, we show that these conditions are satisfied by a class of LUE matrices in Lemmas 3.3 and 3.4.
Lemma 2.2. Assume Condition 2.1 and that α grows with N as in (1.5). Then there exists a constant C > 0 such that Proof . First, if xy > ab and all these numbers are positive, then either x > a or y > b. Thus for 0 ≤ s ≤ 1, the tails bounds of Condition 2.1 imply This bound may be 'inverted' in the following way. If we define T −1 (s) = T −1 max (s)T −1 max (s), then Our goal is to obtain an upper bound on T (t) using the upper boundsT max andT min . If T max (t 1 ) = s andT max (t 2 ) = s then t 2 ≥ t 1 . Therefore, Since (1 + log t/n) n ≤ t when t > 1 and n ≥ 1, we can estimate . Hence Inverting this expression, we find Let us examine this lower bound more carefully. We increase C 1 so that C 1 ≤ C 2 , if necessary, so that T −1 (T max (a)) = aT −1 min (T max (a)) ≤ af (N ) where C is a suitable constant. Then using the assumption (1.5) and is bounded by a constant, say, C/2 > 0. This establishes the lemma.
Let Σ N denote the set of N × N strictly positive definite complex matrices and recall that the constant δ N is defined in Lemma 2.2. The following lemma is applied to control the halting time in terms of the condition number, and the reader may turn to the lemmas that follow to see instances of functions g.
Proof . Using Lemma 2.2 for b N = af (N )(1 + δ N ) and N sufficiently large This last inequality follows from the scaling (1.5): and hence α/2 − e N = O(α). For any fixed there exists a constant K = K We apply this lemma to the following functions.
Further, for every η > 0 and > 0 there exists a constant C ,η such that for s ∈ [1, ∞) (2) Halting time with the weighted norm: This function g satisfies the following estimates and · stands for either · 2 or · A .
Proof . All the bounds follow from (1.1) and (1.2) as explained in the introduction to this section. The estimates on the functions g(s), each of which satisfies g(1) = 0, may be obtained by elementary manipulations.
(1) Halting time with the 2 norm: As H satisfies Condition 2.1 we can apply Lemma 2.3 with the estimates in Lemma 2.4(1) for j > 0 and η > 0. We use that in this case d ds g(s) j = jg (s)g(s) j−1 ≤ jC j ,η s (j−3)/2+jη , and hence = (j − 3)/2 + jη in Lemma 2.3. Therefore, for some ζ > 0, we choose η > 0 such that jζη < γ. Thus, (2) Halting time with the weighted norm: We follow the same calculations as (1) with the estimates in Lemma 2.4(2) for j > 0. Here Finally, The constant δ N > 0 is used in the above theorem to make precise the fact that if we integrate the tail of the condition number distribution just beyond af (N ) the error term is exponentially small as α → ∞.

The Laguerre unitary ensemble
The following is well-known and may be found in [6, Section 2], for example. This discussion is modified from [4,Section 2]. Let W = XX * where X is an N × (N + α) matrix of iid standard complex Gaussian random variables. Recall that the (matrix-valued) random variable W is the Laguerre unitary ensemble (LUE). Then it is known that the eigenvalues 0 ≤ λ min = λ 1 ≤ λ 2 ≤ · · · ≤ λ N = λ max of W have the joint probability density Recall that the Laguerre polynomials, {L , are a family of orthogonal polynomials on [0, ∞), orthogonal with respect to the weight e −x x α . We normalize them as follows [13] L (α) Then the following are orthonormal with respect to Lebesgue measure on [0, ∞), Define the correlation kernel The kernel K N defines a positive, finite-rank and hence trace-class operator on L 2 ([a, b]). To see that K N is positive, consider f ∈ C ∞ ((s, t)) with compact support and note that The eigenvalues λ 1 ≤ · · · ≤ λ N may be described in terms of Fredholm determinants of the kernel K N [1,6]. In particular, the statistics of the extreme eigenvalues are recovered from the determinantal formula By the Christoffel-Darboux formula [20], we may also write Thus, questions about the asymptotic behavior of K N (x, y) as N → ∞ reduce to the study of the large N asymptotics of L

Kernel estimates
We use Fredholm determinants to show that Condition 2.1 holds with appropriate constants when W = XX * is distributed according to LUE. The main reference for these ideas is [17]. Let A : L 2 ([t, ∞)) → L 2 ([t, ∞)) be a positive trace-class operator with kernel K(x, y). Assume In this way we can get estimates on the tail directly from the large x behavior of K(x, x). Similar considerations follow if, say, A t : As was done in [4], we consider the scaled kernel Next, we pull results from [4] to estimate the kernel K s N (x, y) for LUE near the largest and smallest eigenvalue of W . We first look for the asymptotics of K s N (x, y) for (x, y) ≈ (1, 1), called the soft edge. Letx = 1 + x/(2 2/3 M 2/3 ) and defině Then from [4, Proposition 2]: Proposition 3.1. As N → ∞ the rescaled kernels converge pointwise, and the convergence is uniform for (x, y) in a compact subset of [L, ∞) 2 for any L ∈ R. If x = y then the limit is determined by continuity. Further, there exists a positive, piecewise-continuous functionḠ : (L, ∞) 2 → (0, ∞), such that Furthermore, it suffices to take for a constant C = C(L) > 0 For K s N (x, y) near (0, 0) as the scaling of the kernel depends critically on γ. So, we definê The next proposition essentially follows directly from [4, Proposition 1] and is in fact a little simpler with the scaling chosen here. Furthermore, it suffices to take for a constant C = C(L) > 0 G(x, y) = C(ǧ(x)ȟ(x)χ |x−y|<1 +ǧ(x)ǧ(y)),
We now briefly describe how the estimates in terms ofǧ andȟ arise. The asymptotics of the kernel K s N is given in terms of Bessel functions, after a change of variables. In the regime
Since we may choose d as needed, we assume that αd ∼ N λ → ∞. We then have from Lemma 3.4, with a possibly new constant C, This follows from the fact that (N q ) N −λ → 1 for any value of q. We define δ by 1 + δ = (d) −2 and then the matrix A + σ 2 H satisfies Condition 2.1 with c 1 = 4/3, a = 1 + σ −2 , α → dα and f and δ as defined here. Different values of λ can be used to create different estimates. But for simplicity, we take λ = γ/2 or d = N −γ/2 . Then by (3.1), for small, δ = 1 − (d) −2 ≥ N −γ/3− if N is sufficiently large and log(1 + δ) N γ/2 → ∞, with some power of N . Therefore (1 + δ) −dα/K tends to zero faster than any power of N if K > 0 is fixed. We now establish each estimate by appealing to Theorem 2.5.