Large Deviations Theory and Applications

SMALL RANDOM PERTURBATIONS OF DYNAMIC SYSTEMS
SIMULATED ANNEALING : SPEED AND PARALLELIZATION
AUTOMATIC LEARNING AND VAPNIK BOUNDS

 

SMALL RANDOM PERTURBATIONS OF DYNAMIC SYSTEMS
1975-1985

My first brush up with large deviations and rare events was a very intensive joint work with Gabriel RUGET. We had both just been involved in a one year reflection group at Ecole Normale Sup. Paris on adaptive control of dynamic systems, jointly organized with Albert BENVENISTE and Philippe COURREGE, and the connected question of probabilistic models for learning processes was evoked there, only to be quickly dismissed due to the definitely “soft mathematics” colour of the then available litterature on probabilistic learning. Fueled by our tête-à-tête discussions on sufficiently generic but still manageable probabilistic models for learning processes, G. RUGET and myself started an exciting quest to discover computable asymptotics for “slow learning” random dynamics.
We decided to study a large class of continuous random dynamic systems, determined by a finite set of random vector fields on a differentiable manifold : at each point in space, the dynamic system picks its tangent vector at random among the finite set of vector fields. These local random choices are defined by a smooth field (µx) of probabilities carried at each space point x by the tangent space Tx . Slowing down the time clock of these “learning” dynamics brings trajectories close to the “mean field” deterministic dynamics, as could be deduced from adequate extensions of local laws of large numbers. We then studied the asymptotics of the necessarily very rare escape paths from basins of attraction of the mean field deterministic dynamics. This opened up a whole Pandora box of local large deviations theory for our slow learning processes, and we were naturally led to build a complex extension of the major work of VENTCELL-FREIDLIN on small diffusions. Our random systems could indeed be viewed as a whole new class of small random perturbations of deterministic dynamic systems.
We thus proved that for these random systems with slow dynamics, escape from stable fixed points of the deterministic mean field dynamics, were rare events with exponentially decaying probabilities, and we computed a closed formula for an “energy functional” L(f) defined on the path space, providing the exponential rate of decay for probability of escape along any given path f . Namely, we showed that L(f) was the integral energy of the path speed df/dt for a non Riemannian metrics, defined by a continuous field of positive convex functions Lx on the tangent spaces Tx. In fact, for each x, the function Lx is the CRAMER-CHERNOV transform of the probability µx . This led ud to prove that escape from an equilibrium point of the mean field dynamics, whenever it occurred, most likely took place along “critical escape paths”, and that these critical paths solved an explicit variational problem. Namely, critical paths were geodesics of the non Riemannian metrics just defined. It was then possible to write the nonlinear Hamilton-Jacobi partial differential equations solved by the critical paths.
The masterly early papers of Srinavasa VARADHAN and Daniel STROOCK on large deviations theory of course offered us the right conceptual probabilistic framework in which to frame and formulate our results. Inevitably, I had many more opportunities to rely on S.R.S. VARADHAN’s deep contributions to large deviations theory, in particular while directing the PhD thesis of Francis COMETS on large deviations for Gibbs fields. F. COMETS proved interesting and technically difficult results on low temperature asymptotics for transitional dynamics between two equilibria of spins fields, as well as on large deviations for the empirical long time average dynamics of stationary Gibbs fields, in the spirit of S.R.S. VARADHAN’s fundamental papers on large deviations for stationary random processes. COMETS went on to become a recognized expert in this field.
A few years later, I directed the PhD thesis of Isabelle NAGOT to extend the results RUGET and I had obtained for continuous measure valued random fields (µx) to measure valued random fields having discontinuities when x belongs to a smooth sub-manifold of the state space. This turned out to raise particularly delicate technicalities to analyze process behaviour across such discontinuity surfaces, but NAGOT obtained explicit formulations of the partial differential equations solved by critical paths.
This in depth incursion into large deviations theory launched me on a long term systematic study of Rare Events theory, LEGENDRE duality between convex functions, and CRAMER-CHERNOV formulae for large deviations of sums of multivariate random variables. I soon understood clearly the existence of a powerful knot between the study of small random perturbations of dynamic systems (or “small diffusions”) à la VENTCEL-FREIDLIN and more classical large deviations theory. The idea I developed was that diffusion processes with small diffusion coefficients could be seen in fact as “almost smooth” pathwise transforms of small ordinary brownian motion, and that it must hence be possible to transfer, by direct image of probabilities in path spaces, the easily formalized large deviations results for small gaussian processes, into large deviations in paths space for small diffusions. This esthetic and intuitive machinery worked very well, and enabled me to give a new cleaner proof of VENTCELL-FREIDLIN results, and to obtain, as a side benefit, a natural extension of their results from elliptic diffusions to hypo-elliptic small diffusions, a difficult task which was not easily feasible with their original method. I was then invited by Paul Andre MEYER, Jacques NEVEU, Didier DACUNHA-CASTELLE to give a summer course on Large Deviations in the yearly Ecole of Probabilites of Saint-Flour (France), where I linked these results to a systematic formalization of rare events theory, in the spirit of BAHADUR-ZABELL papers, and of course of the VARADHAN-STROOCK formulations. The intellectual impact of my course on a group of very active french probability PhDs encouraged me to intensify my activity in the field of large deviations theory, and to explore applications in physics of small random perturbations in path space.
I directed the PhD thesis of P. LOTI-VIAUD, who had begun exploring an idea of G. RUGET, and studied the “large deviations” asymptotics of random evolutions for populations where the generating functions governing the offspring on any individual depended on its spatial localization. The goal was to obtain and numerically solve the partial differential equations governing the asymptotics of population propagation in time and space. This was a strong motivation to launch an intensive 2 years team work on “geodesics and small time behaviour of diffusion processes” following MOLCHANOV, as described above. A collaboration with Halim DOSS then was the opportunity to apply probabilistic stationary phase methods on WIENER spaces of trajectories to the asymptotic “semi classical expansions” for the fundamental solution of the Schroedinger operator when the Planck constant tends to zero.
I then undertook the exploitation of “rare events” probabilistic theory in path spaces to generate rigourous and precise small time asymptotics expansions of fundamental solutions of smooth 2nd order parabolic operators, viewed as probability densities of diffusions processes. Indeed MOLCHANOV’s elegant work only yielded the proper geometric interpretation for the 1st term c(x,y) t - k/2 exp[-d2(x,y)/2t] in the small t expansion of the density p(t,x,y) on a Riemannian manifold of dimension k . My approach provided complete small time asymptotic expansions of diffusions probability densities p(t,x,y) , replacing the constant c(x,y) by a polynomial in positive integer powers of t1/2, with good estimates of the remainder term. I thus obtained interesting closed form stochastic integrals formulae for the coefficients of this polynomial, and introduced for this purpose a key probabilistic tool : the construction of explicit pathwise stochastic Taylor formulas for the random paths of diffusion processes in small time, where the coefficient of t n/2 in the pathwise Taylor series involves multiple stochastic integrals of order n . In the particular case of the logarithm of Brownian motion on Lie groups, the corresponding pathwise Taylor stochastic expansions coincided neatly with the explicit and elegant expansions obtained for the Lie group case by G. BENAROUS in his brilliant PhD thesis. This point of view, applied to delicate asymptotic expansions of the Brownian bridge on Riemannian manifolds, became helpful when the french BOURBAKI group asked me to present a synthetic analysis of J.M. BISMUT’s remarkable probabilist proof of ATIYAH-SINGER index theorem, and I was thus able to provide an interesting shortcut for a key probabilistic point of BISMUT's complex and original approach. My understanding of this BISMUT’s paper became the topic of a seminar I gave in 1986 at the invitation of S.R.S.VARADHAN, at Courant Inst., New-York.
e ELIE], I then launched a one year intensive seminar at University Paris 7 to study in depth the stunning probabilist work of MOLCHANOV on explicit links between the small time behaviour of fundamental solutions pt(x,y) for 2nd order elliptic operators on Riemannian manifolds and the geodesic structure of the manifold. MOLCHANOV had proved an elegant geometric formula for the 1st term in the small time asymptotics of the diffusion density p(t,x,y), as c(x,y) t-k/2 exp[-d2(x,y)/2t] , where d(x,y) is the geodesic distance between points x and y, and k is the dimension of the manifold. We were able to broaden and clarify MOLCHANOV’s approach, and to extend it neatly to hypo-elliptic differential operators. I clarified the links of these results with the “large deviations” asymptotics of small random perturbations of the deterministic geodesic flow, in the spirit of my simultaneous ongoing work on small random perturbations of dynamic systems ; finally, I edited and coordinated our joint book on the topic.

References :
Large Deviations and slow learning processes
R. Azencott, G.Ruget, Comptes Rendus Ac.Sci., Paris, vol 281 ser A, pp 529-532 1975

Random mixing of differential equations : Large deviations theory
R. Azencott, G.Ruget, Zeitschrift Wahrschein.Th. und ver. Gebiete. vol 38, pp 1-54 1977

(Book) Large Deviations theory and Applications
R.Azencott, invited course, Saint-Flour Summer School in Probability Th.
Lecture Notes Math , vol 774 , 176 pages Springer-Verlag 1980

Stochastic Taylor Formula and Feynmann integrals
Lecture Notes Math. ,vol 921, pp 237-265, Springer-Verlag 1982

Asymptotic small time expansions for densities of diffusion processes
Lecture Notes Maths vol 1059 pp 402-498 Springer-Verlag 1984

Small random perturbations of dynamic systems : asymptotic expansions
Bull. Soc. Math. France vol 109 pp 253-308 1985

Schroedinger Equation as « h » tends to zero : a probabilistic approach
R. Azencott, H. Doss in CIRM 1985 colloquium (Editor Albeverio)
Lecture Notes in Maths. Springer-Verlag vol 1109 1985

Atiyah-Singer Index theorem : Bismut’s approach
Seminaire Bourbaki 1984 Asterisque (Soc. Math. France) 1985

 


SIMULATED ANNEALING : SPEED AND PARALLELIZATION
1984-1992

In 1983, I attended an advanced seminar at Les Houches, France, on new trends in statistical physics, sponsored by IBM, and organized by physicists (SHERRINGTON, KIRKPATRICK, G. TOULOUSE, M. MEZARD). The seminar focused on Spin Glasses and Simulated Annealing, a seductive stochastic gradient descent algorithm, originally inspired by clever analogies with the physical dynamics of interactive fields of spin vectors, and by the actual cooling of large arrays of interactive particles. Simulated annealing seemed to have “universal” minimizing virtues for NP hard combinatorial optimization. I was struck by the variety of massive optimization problems which simulated annealing could apparently attack, and by its “physical model” as semi-random progressive self organization capacities for large arrays of interactive particles submitted to “slowly cooling” temperature dynamics, which stabilizes the particle system in configurations minimizing a global energy function. Given the lack of formal probabilistic study for asymptotics of simulated annealing, I jumped on this exciting new subject in 1984, and received another strong impetus when I read the ground breaking work of Donald and Stuart GEMAN (1984) in an exciting new domain : restoration of noisy digitalized image by Markov Random Fields “energy” modelization, and energy minimization by simulated annealing. This opened for me a 10 years intensive research activity in three interwoven domains : Simulated annealing, Markov Random Fields, and Energy minimization for digital image analysis.
I decided to explore these themes in depth and simultaneously, and enrolled a group of about twelve bright young applied mathematicians who became almost contemporaneously PhD students under my direction. With their support, I founded at Ecole Normale Sup. Rue d’Ulm Paris, a new laboratory called DIAM : distributed artificial intelligence and mathematics. I centered our Markov fields and image analysis activities at University Paris 11 in Orsay, and our mathematical explorations on simulated annealing and neural networks at Ecole Normale Sup, Rue d' Ulm. On the theoretical level, I soon understood that the detailed asyptotics study of simulated annealing was closely related to large deviations estimates of rare events such as excursions out of energy wells for the random gradient descent. My own large deviations approach to VENTCEL-FREIDLIN random excursions for dynamic systems submitted to small perturbations then became an essential guideline to seriously attack precise asymptotics for simulated annealing. I hence focused and directed along this path the PhD research of two unusually gifted mathematicians, Olivier CATONI and Alain TROUVE, to reach a large array of definitive and powerful theoretical results on exact asymptotic rates of convergence for simulated annealing. In particular, CATONI’s PhD thesis completely clarified the links of these asymptotic rates with appropriate energy landscape descriptors, and with a “cycle decomposition” of this energy landscape in the spirit of VENTCEL-FREIDLIN; his deep results brought him a coveted IBM mathematical prize.
I then oriented my research group towards the extensions of these sophisticated large deviations techniques in a very rich direction : the effect of synchronous parallel dynamics on simulated annealing, in particular through the scientific direction of TROUVE’s PhD thesis. The powerful and elegant results of TROUVE brought another mathematical layer to CATONI’s highly technical results, boosting up their clarity and generality, and used them to cleverly analyze the unexpected effects of partial and/or total synchronicity for parallel dynamics in simulated annealing. TROUVE’s work was later an efficient building block in Raphael CERF’s clever asymptotic analysis of the so called “genetic” optimization algorithms.
My interest in parallelization schemes for stochastic relaxation of large arrays of interactive cells or sites was linked to the emergence and availability in France of massively parallel computers such as the Connection Machine with 64 000 processors or the MASPAR with 4096 processors. I was convinced that massively parallel computing was an astonishing new potential for high acceleration of complex computing tasks which could naturally be mapped onto such arrays of processors, such as low level artificial vision tasks performable by artificial retinas. Thus I began a joint theoretical and applied seminar with specialists of parallel computing and artificial retinas, such as Patrick GARDA, Laurent VIROT, and altri, and launched with my DIAM resarch lab a three years intense mathematical exploration of massively parallel dynamics for simulated annealing and their impact on convergence acceleration. To complete our theoretical results by extensive simulations on massively parallel computers, I organized a network of scientific collaborations with major computer science labs in France (ETCA, Univ. of Orleans, IRISA, ENS Lyon). This team work led me to impulse, coordinate, and edit a pioneering book on parallelization schemes for simulated annealing (R. AZENCOTT, O. CATONI, M. DREYFUS, P. GARDA, I. GAUDRON, C. GRAFFIGNE, A. TROUVE, L. VIROT).

References :
Simulated Annealing and Image analysis
R. Azencott, Invited Lecture, Soc. Math. France 1986 Yearly Coll. , 16 p., 1986

Gibbs fields , Simulated annealing and Low Level Vision tasks
Proc RFIA 87 "Shape Recognition" (Antibes) AFCET vol 2, p 1183-1191, 1987

Simulated Annealing
Bourbaki Seminar 1988 Lecture 697, p 223-237 , Asterisque 1989

(Book) Simulated annealing : Parallelization techniques
R.Azencott, O.Catoni , M.Dreyfus, P.Garda, C.Graffigne, H.Lutton, A.Trouve, L.Younes
Editor R. Azencott, 242 pages , Interscience Wiley New York 1992
Selected Book Chapters
Simulated Annealing: Speed and Acceleration ; R. Azencott pp 1-10
Large deviations for sequential and parallel annealing; R. Azencott pp 11-24
Parallel annealing : basic techniques; R. Azencott pp 37-46
Parallel annealing : interacting searches; R. Azencott, C.Graffigne pp 81-90

 

AUTOMATIC LEARNING AND VAPNIK BOUNDS
1994-2001

When I became interested in formal neural networks around 1985, I read the then major theoretical papers on the topic, mostly written by inventive physicists, and was struck by their lack of familiarity with deep and relevant mathematical tools, linked to advanced probability theory and theoretical statistics, which should have provided a powerful mathematical background to analyze learning capacities and complexity computation for learning problems.
I hence organized with O. CATONI, A. TROUVE, L. YOUNES several Ecole Normale Sup. seminars circa 1994-95 on analyses of major books and papers by VAPNIK, CERVONENKIS, VALIANT, BARON, RISSANEN, et altri, to get a better grasp of links between minimal coding theory, complexity computation, speed of functional approximations by finite dimensional subspaces, non parametric statistical asymptotics, V-C dimension and learning speed.
In 1996, I invited VAPNIK to give an advanced course at Ecole Normale Sup., which was an opportunity to technically deepen my own mathematical analysis of his work.
Recall that a very large class of automatic learning problems are essentially formalized by the following paradigm : in a fixed infinite dimensional function space H, select a good approximation of an unknown target function F, given N “training examples” , i.e. given the values of F at N randomly selected points. A learning problem can be consistently solved if there is a learning algorithm which optimally matches the number of hidden parameters (empirical model dimension) with the number N of “training examples”, in order to insure that fixed size estimation errors have probabilities converging to zero as N tends to infinity.
VAPNIK’s major results essentially show that if, for each target function F in the function space H, the corresponding learning problem can always be consistently solved, then convergence to zero for probabilities of fixed size learning errors must occur at exponential speed, the space H must have finite Vapnik-Cervonenkis dimension "V-C dim", and the exponential rates of convergence can be universally computed from the V-C dimension of H.
I intuitively felt that these results could be understood differently, and quite as deeply, as consequences of “uniform” large deviations theory applied to wide classes of underlying probability distributions for the training examples. The universal exponential rate defined by the celebrated V-C dimension would then correspond only to the most pessimistic situation where nothing is known about the underlying probability distribution for the selection of training examples.
After sketching the intuitive steps for a potential proof of this conjecture, I explicited these ideas in an invited 1997 lecture at Brown University 50th aniversary Maths Congress. In 1998, I began directing Nicolas VAYATIS in his PhD thesis, during which, along with further interesting results, he proved this conjecture in 2001-2002.
I had in any case already started in 1996 to apply ideas derived from large deviations techniques and from VAPNIK’s deep statistical approaches such as “support vector machines”, to reach pragmatic and computer implementable algorithms for “parametric dimension selection” in very concrete learning problems, such as optimal automatic dimensionning of formal neural networks in function of the available number of training examples and of the regularity of the function to emulate.
From 2001 to 2002, we intensified our R&D efforts in this direction within the MIRIAD consulting group, to conceive and realize classes of completely autonomous software learning agents, endowed with the ability to automatically monitor their own learning.

References:
Large deviations theory and Vapnik bounds for Learning speed
R. Azencott, Invited Lecture, Brown University 50 th anniversary , Providence USA 1997

Distribution-Dependent Vapnik-Chervonenkis Bounds
R. Azencott, N. Vayatis , in "Computational Learning Theory", Eds. P. Fischer, H. Simon
Lect.Notes in Computer Sciences, vol 1572, pp 230-240, 1999

Refined Exponential Rates in Vapnik-Chervonenkis inequalities
R. Azencott, N. Vayatis; ComptesRendus Ac.Sci., vol 332, ser. I, p.563-568 Paris 2001