# The Levenberg – Marquardt Algorithm (LMA)

The Levenberg – Marquardt Algorithm (LMA) is a commonly used routine for least-squares optimization problems. The current NLLS verb is based upon an APL application developed by Gavaris (1988) which has been used extensively in the oceans science community. Indeed, numerous subsequent applications in other languages (e.g. FORTRAN, C, ACON) have been developed. To this list is now added J.

There are two attachments to this wiki page. These can be obtained by clicking on the attachment tab at the bottom of the NLLS page. One is the NLLS script (nlls_j602_5 april 2010) along with an example application from Haddon (2001) while the other (ls_prob_2 jan 2010) is a least-squares problem described in Bard (1974), starting in section 5.21 and which is referred to in the subsequent chapters. These provide good examples of how NLLS works and is a useful guide for other applications.

## Some Theory

$\theta _{i+1}=\theta _{i}+r_{i}\times v_{i}$ where $r_{i}$ and $v_{i}$ are the step size and direction, respectively, required to change the parameter, update $\theta$ in order to minimize some objective function.

A direction $v_{i}$ is acceptable if and only if there exists a positive definite matrix, $R_{i}$ such that
$v_{i}=-R_{i}q_{i}$ where $q_{i}$ is the gradient vector (first partial derivative).

Thus the basic equation of the ith iteration in any gradient method is
$\theta _{i+1}=\theta _{i}+r_{i}R_{i}q_{i}$ The various gradient methods differ in the manner of choosing $r_{i}$ and $R_{i}$ . In the Newton-Raphson method (to guarantee quadratic convergence), $r_{i}$ is 1 and $R_{i}$ equals $H_{i}-1$ , the Hessian matrix of second partial derivatives. In general, there is a minimum if $H_{i}-1$ and thus $R_{i}$ is positive definite.

In the Levenberg – Marquardt Algorithm (LMA)
$R_{i}=(A_{i}+l_{i}B_{i}^{2})-1$ $B_{i}^{2}$ is a diagonal matrix whose elements coincide with the absolute values of the diagonal elements of $A_{i}$ . Regarding $l_{i}$ , it can be shown that if $B_{i}^{2}$ is any positive definite matrix, then $A_{i}+l_{i}B_{i}^{2}$ is positive definite for sufficiently large $l_{i}$ , no matter what $A_{i}$ . As $l_{i}$ approaches infinity, the term $l_{i}B_{i}^{2}$ dominates $A_{i}$ , resulting in an extremely short step in the downhill direction, $B_{i}^{2}$ being positive definite. On the other hand, as $l_{i}$ decreases, $v_{i}$ approaches $-R_{i}q_{i}=A_{i}-1q_{i}$ , the Newton-Raphson direction. Thus, $r_{i}$ , the step size, is indirectly determined through the choice of $l_{i}$ and in the LMA, can be assumed to be one. In each iteration, progress towards minimization of the objective function is evaluated. If the residual sum of squares is not decreasing, $l_{i}$ is increased according to some rule (e.g. multiplied by order of magnitude) and the objective function re-evaluated. This has the effect of exploring smaller steps until the residual sums of squares decreases. Once this occurs, $l_{i}$ is decreased, thus increasing the step size. In essence, use of $l_{i}$ ensures that $R_{i}$ , the Hessian matrix, remains positive definite and the algorithm is always going towards a minimum and stays away from saddle points.

The routine NLLS allows for both unconstrained and constrained optimization by setting a constraint equal to either ‘OFF’ or ‘ON’. Constraints are achieved through a penalty function, p, added to the residual sum of squares produced by the objective function
$p=SUM(a/h(\theta ))$ For each $\theta$ , two h terms are defined
$h_{u}=ucnstrnt-\theta
h_{l}=\theta -lcnstrnt$ The constant, a, is a measure of the influence of the constraint. The larger the a, the greater the impact of the constraint. Now, a is scaled by the initial value of each parameter
$a=constant*\theta _{i}$ The constant is user-defined. In NLLS, it is set at 0.001 but can obviously be changed if the user desires the penalty function to have more influence on the optimization. Use of a implies that each lower and upper constraint is standardized by the size of $\theta$ $p=\Sigma a/h(\theta )$ $=constant*(\theta _{i}/(\theta -lcnstrnt))+constant*(\theta _{i}/ucnstrnt-\theta ))$ Thus, as the $\theta$ terms approach the constraint, p increases, causing a change in the direction of the optimization.

## Inputs to NLLS

The required inputs of the NLLS verb are the y (dependent) variable, the x (independent) variables, a set of starting estimates of the parameters (initial) and the constraints. As well, the least-squares objective function needs to be defined which from the y and x variables and the parameter set produces a list of residuals.

Within the NLLS verb, a number of variables relevant to the minimization are used. These can be changed by the user although through experience, this has been found to be rarely necessary. These include con = 10 , limit = 100 (maximum number of steps in main loop and the convergence and tolerance criteria in the whilst statement relating to the relative change in the parameter set and residual sum of squares between steps.

## Execution

The NLLS verb is executed as
 'objective function name' NLLS initial  Note that the noun initial can be a scalar (in the case of a one parameter model) as within NLLS, the y parameter set is formed into a list by the assignment par =. ,y .

Experience with this routine is thus far limited to models with less than 40 parameters and 400 observations although in principle much larger models should be able to be run. One quite complicated model (with lots of processing within the objective function) took just over a minute (Windows XP32 on laptop with Intel T7700 (2.4 Ghz) processor and 2 GB ram).

## Outputs

The NLLS verb produces a list of the parameter estimates for each loop executed. A table (stats) is produced upon completion which indicates

• Number of main loops • Final value of lambda • Number of observations • Number of parameters • Residual sum of squares • Mean square residual • AIC (analog for least-squares)

It also produces a number of global variables, including the inverse of the Hessian (hess_inv), the covariance matrix (cov), the standard deviation of each parameter (par_se), the correlation matrix (corr) and the coefficient of variation of each parameter (cv). The final parameter set is in the noun parm.

## Further Developments

This code is offered to the J community as a tool to assist in modeling and its further development is encouraged. It was initially written when the author was just learning J and thus is a bit ‘loopy’. It is hoped that through use, it can be further optimized. It has received a fair bit of testing but errors are always a possibility. Thus, if and when these are encountered, these should be reported to the author who will make the changes on the wiki.

There's also a page on which to work on the code as a collaborative project.