next up previous
Next: Large Scale Visualization Up: Proximity Visualization of Abstract Data Previous: Multivariate Visualization Techniques

Subsections

   
Multidimensional Scaling

This chapter presents an evaluation of algorithms for generating proximity visualizations. Dissimilarities between pairs of objects from a data collection are given by a suitable coefficient, and then approximated by distances between corresponding pairs of entities in a visual representation. The quality of this approximation is expressed as a loss function, which yields the optimal arrangement at its minimum. The algorithms under test are heuristics for numerically minimising this function. Their effectiveness is measured for a representative sample of data collections, and the relative ranking is statistically inferred, as well as illustrated with examples.

   
Problem Definition

Suppose that we are given a collection of n objects and a way of determining the dissimilarity between any pair $\vec{\Delta}=[\delta_{ij} : i,j=1,\ldots,n]$; as set out in Section 1.2. Metric Multidimensional Scaling (MDS) [Borg97,Cox94] is a procedure for finding a configuration $\vec{X}^*=[x_{ia}^* : a=1,\ldots,p]$ of n points in a p-dimensional space, usually Euclidean, such that point $\vec{x}_i^*=(x_{i1}^*,\ldots,x_{ip}^*)^T$ uniquely represents object i, and the Euclidean distance between points $\vec{x}_i^*$ and $\vec{x}_j^*$:

 \begin{displaymath}
d_{ij}(\vec{X}^*)=\left\Vert\vec{x}_i^*-\vec{x}_j^*\right\Vert=\sqrt{\sum_{a=1}^p\left(x_{ia}^*-x_{ja}^*\right)^2}
\end{displaymath} (3.1)

approximates the corresponding dissimilarity $\delta_{ij}$, for all pairs of objects (i,j):

 \begin{displaymath}
\mathop{\forall}_{i<j}d_{ij}(\vec{X}^*)\approx \delta_{ij}
\end{displaymath} (3.2)

In general it is sufficient to consider each pair of objects (i,j) just once: i<j, since dissimilarities are assumed to be symmetric. Elements of an asymmetric matrix $\vec{\hat{\Delta}}$ have to be averaged out prior to the analysis: $\delta_{ij}=\delta_{ji}=(\hat{\delta}_{ij}+\hat{\delta}_{ji})/2$ [Borg97].

   
Loss Functions

For a given configuration $\vec{X}$ the approximation error in representing the dissimilarity between objects i and j can be defined as follows:

 \begin{displaymath}
e_{ij}\stackrel{\mathrm{def}}{=}\left\vert d_{ij}(\vec{X})-\delta_{ij}\right\vert
\end{displaymath} (3.3)

A least-squares MDS technique defines a loss function that is a weighted and possibly normalised sum of errors over all pairs of objects (i,j), and thus it penalises the overall approximation error. A minimum of this function over $\vec{X}$ is subsequently found through numerical optimisation, to yield the desired configuration $\vec{X}^*$.

   
Raw Stress

Raw Stress [Kruskal64] is the most elementary MDS loss function, as it simply accumulates the total squared representation error:

 \begin{displaymath}
\sigma_r(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}e_{ij...
...right)^2=\sum_{i<j}\Big(d_{ij}(\vec{X})-f(\delta_{ij})\Big)^2
\end{displaymath} (3.4)

where $\hat{d}_{ij}$ is the disparity between objects i and j, which is arrived at by applying transformation f to the given dissimilarity $\delta_{ij}$. Since dissimilarities are calculated from object attributes, relationships, or other features (see Section 1.2), and there is no error or uncertainty associated with this process, the correct way to model the disparities is to apply a linear transformation to the dissimilarities: $\hat{d}_{ij}=f(\delta_{ij})=a\delta_{ij}$, where a is chosen to minimise the value of Stress.

The optimal a, which shall be denoted as a*, can be found analytically by differentiation. Alternating this step with an iterative improvement to $\vec{X}$ provides an efficient procedure for finding the solution $\vec{X}^*$ satisfying $\vec{\Delta}$, termed ratio MDS [Borg97]. By considering the dissimilarities directly, we restrict ourselves to an identity transformation, and arrive at an absolute MDS model. At the other end of the spectrum, the transformation can be relaxed just to be monotonic, so that only the rank order of dissimilarities is preserved, because it is assumed to be the only reliable information (see Section 1.2.9), which gives an ordinal (nonmetric) MDS model. Nonmetric MDS can be approximated with the ratio model by replacing the dissimilarities with their rank order a priori [Weeks79,Borg97]. Therefore, we conclude that ratio MDS is the most applicable for visualization, and focus our efforts on this model.

   
Normalised Stress

Raw Stress (3.4) can only be successfully minimised in practice under the absolute MDS model. If we permit a transformation of the dissimilarities then by alternating an update of the configuration $\vec{X}$ with an update to the disparities $[\hat{d}_{ij}]=a\vec{\Delta}$ $\sigma_r(\vec{X},a)$ can be made arbitrarily small, with a and $\vec{X}$ shrinking gradually. This deficiency can be overcome by normalising raw Stress by $\sum_{i<j}\hat{d}_{ij}^2$ [Borg97]:

 \begin{displaymath}
\sigma_n(\vec{X})\stackrel{\mathrm{def}}{=}\frac{\sum_{i<j...
...ij}(\vec{X})-\hat{d}_{ij}\right)^2}{\sum_{i<j}\hat{d}_{ij}^2}
\end{displaymath} (3.5)

Loss function (3.5), termed normalised Stress, expands for ratio MDS to:
 
$\displaystyle \sigma_n(\vec{X},\alpha)$ = $\displaystyle \frac{\sum_{i<j}\left(d_{ij}(\vec{X})-\alpha\delta_{ij}\right)^2}{\sum_{i<j}\alpha^2\delta_{ij}^2}$  
  = $\displaystyle \frac{\sum_{i<j}d_{ij}^2(\vec{X})-2\alpha\sum_{i<j}\delta_{ij}d_{ij}(\vec{X})+\alpha^2\sum_{i<j}\delta_{ij}^2}{\alpha^2\sum_{i<j}\delta_{ij}^2}$  
  = $\displaystyle \frac{\eta^2(\vec{X})-2\alpha\rho(\vec{X})+\alpha^2\eta_\delta^2}{\alpha^2\eta_\delta^2}$  
  = $\displaystyle 1-\frac{2\alpha\rho(\vec{X})-\eta^2(\vec{X})}{\alpha^2\eta_\delta^2}$ (3.6)

where $\eta^2(\vec{X})$ is a sum of the squared distances $d_{ij}^2(\vec{X})$, $\rho(\vec{X})$ is a weighted sum of distances $\delta_{ij}d_{ij}(\vec{X})$, and $\eta_\delta^2$ is a constant equal to the sum of the squared dissimilarities $\delta_{ij}^2$, over all pairs of objects (i,j). The optimal $\alpha$ can be found by differentiating (3.6) with respect to $\alpha$:

\begin{eqnarray*}\frac{\partial\sigma_n(\vec{X},\alpha)}{\partial \alpha}&=&\fra...
...rac{\alpha\rho(\vec{X})-\eta^2(\vec{X})}{\alpha^3\eta_\delta^2}
\end{eqnarray*}


which is equal to zero for $\alpha^*=\eta^2(\vec{X})/\rho(\vec{X})$. Inserting $\alpha^*$ into (3.6) yields:
 
$\displaystyle \sigma_n(\vec{X},\alpha^*)$ = $\displaystyle 1-\frac{\displaystyle 2\rho(\vec{X})\frac{\eta^2(\vec{X})}{\rho(\...
...2(\vec{X})}{\displaystyle \eta_\delta^2\frac{\eta^4(\vec{X})}{\rho^2(\vec{X})}}$  
  = $\displaystyle 1-\left(\frac{\rho(\vec{X})}{\eta_\delta\eta(\vec{X})}\right)^2$  
  = $\displaystyle 1-\left(\frac{\sum_{i<j}\delta_{ij}d_{ij}(\vec{X})}{\left(\sum_{i...
...)^{\frac{1}{2}}\left(\sum_{i<j}d_{ij}^2(\vec{X})\right)^{\frac{1}{2}}}\right)^2$  
  = $\displaystyle 1-c\left(\vec{\Delta},\left[d_{ij}(\vec{X})\right]\right)^2$ (3.7)

The last term of (3.7) is the square of Tucker's congruence coefficient c with dissimilarities and distances [Tucker51]. The coefficient is always between 0 and 1, due to the Cauchy-Schwarz inequality:

 \begin{displaymath}
\sum_rp_rq_r\le\left(\sum_rp_r^2\right)^{\frac{1}{2}}\left(\sum_rq_r^2\right)^{\frac{1}{2}}
\end{displaymath} (3.8)

and the fact that dissimilarities and distances are nonnegative. Hence, it holds that $0\le\sigma_n(\vec{X},\alpha^*)\le1$, for the optimal scaling constant $\alpha$.

It is worth noting that scaling the dissimilarities by a factor $\alpha$ is equivalent to scaling the distances by $1/\alpha$:

\begin{eqnarray*}\sigma_n(\vec{X},\alpha)&=&\frac{\sum_{i<j}\left(d_{ij}(\vec{X}...
...}{\sum_{i<j}\delta_{ij}^2}\;\;=\;\;\sigma_n(\alpha^{-1}\vec{X})
\end{eqnarray*}


The derivation of $\sigma_n(b^*\vec{X})$ is equivalent to that presented above [deLeeuw77,Borg97], and yields $b^*=1/\alpha^*$. However, it relies on the assumption that $\vec{X}$ is a local minimum of (3.5), whereas our derivation does not require such an assumption, and thus is applicable to any configuration $\vec{X}$.

   
Kruskal's Stress

Stress formula 1, or Stress-1 for short, is historically an earlier solution to the scale dependency problem of raw Stress (3.4), with the normalising factor $\eta^2(\vec{X})$ [Kruskal64]:

 \begin{displaymath}
\sigma_1^2(\vec{X})\stackrel{\mathrm{def}}{=}\frac{\sum_{i...
...(\vec{X})-\hat{d}_{ij}\right)^2}{\sum_{i<j}d_{ij}^2(\vec{X})}
\end{displaymath} (3.9)

The whole formula is square rooted by analogy of referring to the standard deviation instead of the variance.

Substituting the ratio MDS model yields:

 
$\displaystyle \sigma_1^2(\vec{X},\beta)$ = $\displaystyle \frac{\sum_{i<j}\left(d_{ij}(\vec{X})-\beta\delta_{ij}\right)^2}{\sum_{i<j}d_{ij}^2}$  
  = $\displaystyle \frac{\eta^2(\vec{X})-2\beta\rho(\vec{X})+\beta^2\eta_\delta^2}{\eta^2(\vec{X})}$  
  = $\displaystyle 1-\frac{2\beta\rho(\vec{X})-\beta^2\eta_\delta^2}{\eta^2(\vec{X})}$ (3.10)

The minimum of (3.10) over $\beta$ is obtained by setting the first derivative with respect to $\beta$ equal to 0:

\begin{displaymath}\frac{\partial\sigma_1^2(\vec{X},\beta)}{\partial \beta}=2\frac{\beta\eta_\delta^2-\rho(\vec{X})}{\eta^2(\vec{X})}=0\nonumber
\end{displaymath}

which yields $\beta^*=\rho(\vec{X})/\eta_\delta^2$. Inserting $\beta^*$ into (3.10) gives:

 \begin{displaymath}
\sigma_1^2(\vec{X},\beta^*)=1-\left(\frac{\rho(\vec{X})}{\...
...^2=\sigma_n(\vec{X},\alpha^*)\stackrel{\mathrm{def}}{=}\sigma
\end{displaymath} (3.11)

Therefore, for a given configuration $\vec{X}$, normalised Stress (3.5) is equivalent to the square of Stress-1 (3.9) if we allow for an optimal scaling of dissimilarities in each case, and we shall refer to this unified index as Stress ($\sigma$). Since such a scaling step is an integral part of any ratio MDS procedure, and is alternated with an iterative improvement to $\vec{X}$, the final solution will be a local minimum of both of these loss functions. A graphical illustration of this equivalence is presented in Figure 3.1(a) and 3.1(b). The relationship (3.11) can also be demonstrated by taking the alternative route of rescaling the configuration $\vec{X}$ [Borg97], however an assumption that $\vec{X}$ is a local minimum has to be made again.
    
Figure 3.1: Visualizations of a 6 level complete binary tree computed by minimising different loss functions. The dissimilarity between a pair of nodes is the length of the path connecting them (see Section 1.2.6)


\begin{picture}(999,999)(-0.5,-0.5)%
\put(499,500){\circle{35}}%
\put(395,50...
...2,116){70.2512}}%
\put(776.159,785.912){\line(78,127){59.6828}}%
\end{picture}

\begin{picture}(999,999)(-0.5,-0.5)%
\put(500,500){\circle{35}}%
\put(604,50...
...,127){58.8542}}%
\put(221.126,784.711){\line(-92,116){70.2512}}%
\end{picture} \begin{picture}(999,999)(-0.5,-0.5)%
\put(500,500){\circle{35}}%
\put(378,50...
...6,-35){72.7649}}%
\put(873.595,278.756){\line(53,-94){35.8101}}%
\end{picture}

(a) $\sigma _n=0.04024$ (b) $\sigma _1^2=0.04021$ (c) $\displaystyle \sigma _E=0.05934\atop \displaystyle (\sigma =0.05054)$


   
Energy

Any of the loss functions defined previously can be generalised to weight the contribution of individual pairs of objects, for example weighted normalised Stress acquires the following form:

 \begin{displaymath}
\sigma_{n,w}(\vec{X})\stackrel{\mathrm{def}}{=}\frac{\sum_...
...ec{X})-\hat{d}_{ij}\right)^2}{\sum_{i<j}w_{ij}\hat{d}_{ij}^2}
\end{displaymath} (3.12)

Each weight wij can take any non-negative value, as long as it does not depend on $\vec{X}$. Typical use allows missing dissimilarities to be accommodated by setting the corresponding weights to 0, and the remaining ones to 1. However, by setting $w_{ij}=\hat{d}_{ij}^{-2}$ a new loss function can be obtained, which we shall refer to as Energy:

 \begin{displaymath}
\sigma_E(\vec{X})\stackrel{\mathrm{def}}{=}\frac{2}{n(n-1)...
...{\left(d_{ij}(\vec{X})-\hat{d}_{ij}\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.13)

The ratio MDS model produces the following expansion, with $\displaystyle\frac{1}{\lambda}=\frac{n(n-1)}{2}={n\choose 2}=\sum_{i<j}1$:
 
$\displaystyle \sigma_E(\vec{X},\gamma)$ = $\displaystyle \lambda\sum_{i<j}\frac{\left(d_{ij}(\vec{X})-\gamma\delta_{ij}\right)^2}{\gamma^2\delta_{ij}^2}$  
  = $\displaystyle \lambda\left(\sum_{i<j}1-\sum_{i<j}\frac{2\gamma \delta_{ij}d_{ij}(\vec{X})-d_{ij}^2(\vec{X})}{\gamma^2\delta_{ij}^2}\right)$  
  = $\displaystyle 1-\lambda\left(\frac{2}{\gamma}\sum_{i<j}\frac{d_{ij}(\vec{X})}{\...
...ij}}-\frac{1}{\gamma^2}\sum_{i<j}\frac{d_{ij}^2(\vec{X})}{\delta_{ij}^2}\right)$  
  = $\displaystyle 1-\lambda\left(\frac{2}{\gamma}\psi(\vec{X})-\frac{1}{\gamma^2}\omega(\vec{X})\right)$ (3.14)

An optimal $\gamma$ can be found by differentiating (3.14) with respect to $\gamma$:

\begin{eqnarray*}\frac{\partial\sigma_E(\vec{X},\gamma)}{\partial\gamma}&=&-\lam...
...mbda}{\gamma^3}\left(\gamma\psi(\vec{X})-\omega(\vec{X})\right)
\end{eqnarray*}


which is equal to zero for $\gamma^*=\omega(\vec{X})/\psi(\vec{X})$. Substituting $\gamma^*$ into (3.14) yields:
 
$\displaystyle \sigma_E(\vec{X},\gamma^*)$ = $\displaystyle 1-\lambda\frac{\psi^2(\vec{X})}{\omega(\vec{X})}$  
  = $\displaystyle 1-\lambda\frac{\displaystyle\left(\sum_{i<j}\frac{d_{ij}(\vec{X})...
...{ij}}\right)^2}{\displaystyle\sum_{i<j}\frac{d_{ij}^2(\vec{X})}{\delta_{ij}^2}}$  
  = $\displaystyle 1-\frac{\displaystyle\left(\sum_{i<j}\frac{\delta_{ij}d_{ij}(\vec...
...\delta_{ij}^2}{\delta_{ij}^2}\sum_{i<j}\frac{d_{ij}^2(\vec{X})}{\delta_{ij}^2}}$  
  = $\displaystyle 1-\left(\frac{\sum_{i<j}w_{ij}\delta_{ij}d_{ij}(\vec{X})}{\left(\...
...ac{1}{2}}\left(\sum_{i<j}w_{ij}d_{ij}^2(\vec{X})\right)^{\frac{1}{2}}}\right)^2$  
  = $\displaystyle 1-c_w\left(\vec{\Delta},[d_{ij}(\vec{X})]\right)^2$ (3.15)

The last term of (3.15) is the square of weighted Tucker's congruence coefficient cw with dissimilarities and distances [Tucker51]. The coefficient is always between 0 and 1, due to the Cauchy-Schwarz inequality (3.8). Hence, it holds that $0\le\sigma_E(\vec{X},\gamma^*)\le1$, for the optimal scaling constant $\gamma$.

Energy (3.13) penalises error (3.3) in representing short dissimilarities more than the same error for larger dissimilarities. Such a proportional contribution of error is likely to agree with user's expectations, because it does not matter exactly how distant dissimilar objects are from a given object, as long as they are far in relation to similar ones. Figure 3.1 gives a visual comparison of the results of minimising Stress (3.11) and Energy. Stress, on the other hand, minimises absolute error, since errors in representing large and small dissimilarities are penalised equally, and does not preserve the local detail as well as Energy. Therefore, we prefer to use Energy for proximity visualization, with the exception of Chapter 5.

For simplicity we drop the normalising constant $\lambda$ when deriving MDS algorithms, and minimise raw Energy instead:

 \begin{displaymath}
\sigma_{Er}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}\frac{\left(d_{ij}(\vec{X})-\hat{d}_{ij}\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.16)

which can be derived from the weighted form of raw Stress (3.4):

 \begin{displaymath}
\sigma_{r,w}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}w_{ij}\left(d_{ij}(\vec{X})-\hat{d}_{ij}\right)^2
\end{displaymath} (3.17)

by setting the appropriate weights. We note that minimising (3.16) is equivalent to minimising (3.13), and we use the normalised form when reporting results.

(3.16) may be interpreted as the total energy of a fully connected spring system, with an anchor for each object, and springs connecting it to all other anchors. The relaxed length of a spring connecting two anchors is given by the optimally scaled dissimilarity between the corresponding pair of objects. The actual length of the spring is the Euclidean distance between the anchors. The spring constant is $\hat{d}_{ij}^{-2}$ for a spring connecting anchors i and j, which causes the spring to be rigid if $\delta_{ij}$ is small in relation to other dissimilarities, and therefore to contribute substantially to the value of (3.16) if deformed. Equilibrium of this spring system corresponds to a local energy minimum.

   
Algorithms

Each of the least-squares MDS algorithms presented here is an instance of a well-known function minimisation heuristic. Such a heuristic constitutes a theoretical framework, and typically a large number of options or variations contributed by practitioners in the operational research and other fields. Some of the choices are simply determined by the application domain, others can be made on theoretical grounds, or empirical evidence - either found in literature or established on our own. In each case we outline the heuristic framework, giving references to dedicated texts, and detail the decisions and customisations made to provide a complete algorithm.

   
Classical Scaling

Classical scaling [Gower66,Cox94,Borg97] is an analytical technique for solving the metric MDS problem, and it was the first to be developed. Dissimilarities are treated as Euclidean distances, and a matching configuration $\vec{Z}$ is recovered from the eigendecomposition of $-\frac{1}{2}\vec{J}\vec{\Delta}^{(2)}\vec{J}=\vec{Z}\vec{Z}^T$, where $\vec{J}$ is the centring matrix, and $\vec{\Delta}^{(2)}=[\delta_{ij}^2]$. $p\le n-1$ eigenvalues will be non-zero, and they represent the minimum number of dimensions required to preserve all the dissimilarities, so that (3.2) holds exactly. Selecting p'<p largest eigenvalues and corresponding eigenvectors is equivalent to selecting the first p' principal components $\vec{Y}$ of the p-dimensional configuration $\vec{Z}$, i.e. the projection of $\vec{Z}$ onto p' orthogonal axes that preserves the maximum variance (see Section 4.3). This combined procedure is termed Principal Coordinate Analysis (PCO)% latex2html id marker 17163
\setcounter{footnote}{2}\fnsymbol{footnote} [Gower66], and analytically minimises the following criterion over $\vec{Y}$:

 \begin{displaymath}\sum_{i<j}\left(d_{ij}^2(\vec{Z})-{d}_{ij}^2(\vec{Y})\right) ...
...v \sum_{i<j}\left(\delta_{ij}^2- {d}_{ij}^2(\vec{Y})\right)
\end{displaymath} (3.18)

Note, that under the projection ${d}_{ij}(\vec{Y}) \le d_{ij}(\vec{Z})=\delta_{ij}$, for all pairs (i,j), i.e. distortion of the dissimilarities will be biased, unlike for the least-squares MDS loss functions of Section 3.2.

   
Newton-Raphson

The initial MDS algorithm that we have developed is based on the generalised Newton-Raphson (NR) method [Press92]. It is an iterative technique for solving a vector equation of the form $\vec{f(x)=0}$. If $\vec{f(x)}$ represents the first partial derivatives of a multidimensional function $e(\vec{x})$ this procedure will find an extremum of this function.

The gradient vector $\vec{g}$ and the Hessian matrix $\vec{H}$ of loss function (3.16) at point $\vec{x}_k=\left(x_{k1},\ldots,x_{kp}\right)^T$ can be established analytically (see Appendix A for details):

$\displaystyle g_a(\vec{x}_k)$ = $\displaystyle \frac{\partial \sigma_{Er}(\vec{X})}{\partial x_{ka}}=2\sum_{l\ne...
...rac{d_{kl}(\vec{X})-\hat{d}_{kl}}{d_{kl}(\vec{X})\hat{d}_{kl}^2}(x_{ka}-x_{la})$  
$\displaystyle H_{ab}(\vec{x}_k)$ = $\displaystyle \frac{\partial ^2\sigma_{Er}(\vec{X})}{\partial x_{ka}\partial x_...
...{X})-(x_{ka}-x_{la})^2}{d_{kl}^3(\vec{X})\hat{d}_{kl}} & a=b
\end{array}\right.$  

An extremum of the loss function with respect to point $\vec{x}_k$ can then found by repeatedly applying the NR iteration:

\begin{eqnarray*}\tilde{\vec{x}}_k&=&\vec{x}_k-\vec{H}^{-1}\vec{g}\\
\vec{H}\left(\tilde{\vec{x}}_k-\vec{x}_k \right)&=&-\vec{g}
\end{eqnarray*}


Let the ith column of $\vec{H}$ be denoted by $\vec{H}_i=\left(H_{1i},\ldots,H_{pi}\right)^T$
 
$\displaystyle \tilde{\vec{x}}_k$ = $\displaystyle \vec{x}_k+\left(\frac{\det\left(-\vec{g},\vec{H}_2,\ldots,\vec{H}...
...ac{\det\left(\vec{H}_1,-\vec{g}, \ldots,\vec{H}_p\right)}{\det \vec{H}},\right.$  
    $\displaystyle \left.\cdots,\frac{\det\left(\vec{H}_1,\vec{H}_2,\ldots,-\vec{g}\right)}{\det \vec{H}}\right)^T$ (3.19)

Figure 3.2(a) depicts the Energy function (3.13) for a trivial case when there is a single constant (fixed) point $\vec{x}_f$ to optimise $\vec{x}_k$ against in 2 dimensions. A circle of radius $\hat{d}_{fk}$ centred at $\vec{x}_f$ is the set of all minima. There is a local maximum at $\vec{x}_f$ where the function takes value 1. However, it is a singular point, as is apparent from Figures 3.2(b) and 3.2(c); consequently, gradient $\vec{g}$ does not have a root at $\vec{x}_f$. Moreover, if $\vec{x}_k$ gets inside the minimum circle it will be repelled with a subsequent NR iteration (3.19), because the direction of $\vec{g}$ is reversed over the cusp at point $\vec{x}_f$, whereas Hessian $\vec{H}$ is not affected. This argument extends to the case of multiple constant points; see Figure 3.3 for an illustration. Because of the parabolic characteristic of the component Energy functions, no new maxima can be introduced by their summation, and thus the NR method will only converge to a minimum. Unlike the steepest descent method [Press92] (also see Section 3.3.5) an upward move - a transition from $\vec{x}_k$ to $\tilde{\vec{x}}_k$ that increases Energy - is possible with an NR iteration, especially if the current solution is far from a minimum. This will prevent the method from getting stuck in a poor local minimum, in general.


   
Figure 3.2: Plots of the Energy function and its first partial derivatives for one fixed point in 2 dimensions

\epsfig{file=images/energy.eps, width=4cm}

\epsfig{file=images/partialx.eps, width=4cm}

\epsfig{file=images/partialy.eps, width=4cm}

(a) Surface plot of $\sigma_E(\vec{x}_k)$
(b) Surface plot of $\partial \sigma_E(\vec{x}_k)/\partial x_{k1}$ (c) Contour plot of $\partial \sigma_E(\vec{x}_k)/\partial x_{k2}$ (symmetrical to xk1 derivative)


 
Figure 3.3: Surface plot of $\sigma_E(\vec{x}_k)$ for 10 fixed points in 2 dimensions. These points can be identified as distinct peaks - local maxima. The white elliptical area encompasses the global minimum

\epsfig{file=images/surface.eps, width=9cm}



To minimise the loss function with respect to the whole configuration, the algorithm applies a single NR iteration (3.19) to each point in turn, and continually cycles through the configuration until convergence occurs. Since the focus of our work is visualization, we only need to consider MDS in at most three dimensions. By deriving analytical formulae for up to 3 variables and hardcoding them, we have been able to substantially increase the overall efficiency of the algorithm. An algorithm for drawing general undirected graphs [Kamada89] also minimises (3.16) using the NR method, and its basic details are identical.

The sequence in which points are considered when updating the configuration is randomised to ensure faster convergence [Frick94]. This has a side effect of randomising output of the algorithm, even if the same starting configuration is used. To detect convergence a record of previous configurations is kept. The algorithm terminates if the current configuration is identical to a prior one, to within a certain level of precision $\epsilon$ (NR.epsilon parameter); this can be triggered either by a genuine convergence or an occurrence of a cycle. Using a hash value of a complete configuration minimises the storage requirements of this scheme, and makes the test for equality of two configurations take constant time. The Secure Hash Algorithm [NIST95] has been chosen for its property of making it computationally infeasible to find two different messages - MDS configurations in this case - that produce the same hash value.

It is possible for an NR iteration (3.19) to result in a radically different new location of a point from its original one, normally causing a drastic increase in Energy (3.13). This will have a knock on effect on points considered subsequently, and may result in a perpetual instability of the configuration. This would not constitute a cycle in general, and therefore to trap this behaviour a separate mechanism is required. A cyclic buffer of size l (NR.buffersize parameter) is used to record the Energy of the l most recent configurations. If the current configuration is found to be inferior or equivalent to the oldest configuration in the buffer, a flag is set. l configuration updates are then performed unconditionally; if the situation does not reoccur within another l updates the flag is cleared. Otherwise, the flag is kept set, and $\epsilon$ is increased by one order of magnitude. If instability persists $\epsilon$ will increase to a point where any two configurations will be judged equivalent, and the algorithm will necessarily stop.

   
Tabu Search

Tabu search (TS) [Glover95] is a meta-heuristic that improves on known numerical optimisation heuristics by exploiting principles of intelligent problem solving. The key element is the use of flexible memory to guide the optimisation process around difficult regions, and allow boundaries of local optimality to be crossed, so that new regions of solution space can be explored. This is achieved by systematically imposing and releasing restrictions on certain search moves (transitions from one solution to the next), based on their recency, frequency, quality, and influence. These restrictions can be implemented by directly excluding such moves and classing them as `tabu' or `forbidden'; an alternative is to modify loss function evaluations for these moves or the probabilities of their selection.

To illustrate the basic principles of tabu search we will describe the main features of our initial application of this method to MDS. A search move consists of applying a single Newton-Raphson iteration (3.19) to a point in the configuration. Instead of simply cycling through the moves, as is the case for the NR algorithm (see Section 3.3.2), they are selected based on the optimisation history. If a move has been recently performed it is classified as tabu, and excluded from consideration for the number of iterations equal to the duration of the tabu restriction. However, an important exception is taken into account: the restriction is revoked if the move would result in the best solution found so far. Such a mechanism for overriding the tabu classification is termed an aspiration criterion. Another aspiration criterion is used alongside the best solution criterion: aspiration by search direction permits a tabu move that improves on the current solution if it also caused improvement the last time it was performed.

For each move its associated value can be determined as the change to the value of raw Energy (3.16) it would cause if it was applied. Of all admissible (non-tabu) moves the one with the highest value is selected, as long as it is positive. If no improving move exists frequency-based memory is used to determine the winner. The moves are sorted in the descending order of their values, and the one with the smallest sum of its rank and frequency (count of its prior occurrence) is selected. A tie is broken by selecting the least frequent move. Frequency memory operates over the long term, and its application here diversifies the search, driving it into new regions.

We found however that this algorithm had poor convergence performance. The cause of the problem was that moves which consistently had large negative values would only be selected after frequency memory disqualified other moves, i.e. after a large number of iterations. Such moves are temporarily disadvantageous but crucial if a minimum is to be found. Consequently, many iterations were being wasted considering only a subset of the configuration. The second shortcoming was that in order to calculate move values incrementally between iterations, for efficiency reasons, a different loss function to raw Energy (3.16) had to be used. This is due to the fact that (3.16) involves a square root calculation, and its update for each move cannot be determined analytically, i.e. in fixed time. We overcame this problem by squaring both terms in the numerator of (3.16):

 \begin{displaymath}
\sigma_{Er^2}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}...
...ft(d_{ij}^2(\vec{X})-\hat{d}_{ij}^2\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.20)

which can be derived from weighted S-Stress [Takane77] by setting $w_{ij}=\hat{d}_{ij}^{-2}$:

 \begin{displaymath}
\sigma_{r^2}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}w_{ij}\left(d_{ij}^2(\vec{X})-\hat{d}_{ij}^2\right)^2
\end{displaymath} (3.21)

However, the expanded analytical formulas for updating (3.20) were complicated and expensive to compute.

The reason that we thought TS should perform better than NR is that it would not indiscriminately apply the moves, and hence could prevent configuration instability altogether. Since it is important for all points to be updated regularly, but avoid drastic moves (see Section 3.3.2), we decided to rely solely on frequency memory for selection, combined with move values, in the revised version of the algorithm. The value of a move is defined as the contribution of the originating point $\vec{x}_i$ to the overall value of (3.16), i.e. it is the total energy of springs stemming from it:

 \begin{displaymath}
\sigma_{Er}(\vec{x}_i)=\sum_{j\ne i}\frac{\left(d_{ij}-\hat{d}_{ij}\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.22)

These values can be updated between iterations with little overhead. The move with the lowest sum of its value rank in descending order and its frequency is selected, favouring the least frequent one if there is a tie. This will have an effect of keeping frequency of all moves roughly equal, and making subsequent selection of a drastic move very likely, which will give the corresponding point an opportunity to finally assume a favourable location. The termination criterion is the same as in the NR algorithm, i.e. when the configuration reaches stability up to a desired level of numerical precision (TS.epsilon parameter).

   
Genetic Algorithm

Genetic Algorithms (GAs) [Goldberg89,Reeves95] exploit random search in an intelligent manner to solve optimisation problems. Because they work with a coding of the parameter set, and not the parameters themselves, they are generic and largely domain independent. To steer the process only some measure of performance or fitness is needed, for example evaluation of the loss function in the case of a function optimisation problem. GAs draw a parallel to the mechanics of natural evolution. A population of candidate solutions (chromosomes) is maintained, and they are selectively combined by means of a crossover operator into a new generation of solutions. A mutation operator is applied probabilistically to diversify the population, and prevent premature convergence. Although selection is random, it is biased towards solutions with high fitness, giving their parameters (genes) a better chance of propagating to subsequent generations, and therefore influencing the optimisation process. A record of the best solution is kept, and returned upon termination.

Because MDS is a continuous optimisation problem, we have decided to use the real coding [Surry95] in our algorithm. A solution is encoded as a vector of floating-point numbers (genes), which directly correspond to coordinates of points in the configuration. Blend crossover (BLX) [Surry95] is used for recombining genes. It takes a parameter $\alpha$ that controls its precise effect, and a pair of values x and y to recombine. Assuming $x\le y$ without loss of generality, its outcome is a uniform random number drawn from the interval $[x-\alpha(y-x), y+\alpha(y-x)]$. BLX0 operates on the range [x,y], and is therefore biased towards the centre. This effect is countered by using $\alpha=0.5$. A new solution is derived from a pair of chromosomes by applying BLX0.5 to each pair of corresponding genes.

The most appropriate form of mutation to use with the real coding is creep mutation [Surry95]. Instead of replacing the affected gene with a new random value, the current value is perturbed by a small amount. This can be achieved by adding on a number drawn from the Gaussian distribution of zero mean and standard deviation sigma (GA.sigma parameter), for example. The crossover operator is most effective when the population is diverse. At this stage of optimisation too high a level of mutation will interfere with crossover. However, once the population starts to converge, crossover becomes ineffective, and a good rate of mutation is essential if different solutions are to be explored, and optimisation to continue [Reeves95]. Therefore, we concluded that an adaptive rate of mutation would work best. Whenever BLX is attempted on a pair of genes that are identical up to a certain level of precision (GA.epsilon parameter), Gaussian mutation is performed instead. This form of mutation reintroduces diversity only when needed.

The final facet of the algorithm is the process of selecting which chromosomes will undergo the genetic operations described above, to form the next generation. To steer the optimisation process, it is important to give a higher likelihood of selection to solutions with high fitness $f(\vec{X})=c-\sigma_E(\vec{X})$, where $c\ge \sigma_E(\vec{X})$ for all $\vec{X}$. However, simply making the probability of selection proportional to fitness has undesirable effects. Initially, a relatively good solution could take over the population, and cause it to converge prematurely around a poor local minimum. At the later stages of optimisation, when the population has largely converged, and differences between solutions are small, best solutions will not receive enough emphasis. The desired level of competition among members of the population can be attained throughout the algorithm run if rank of solution fitness is used instead, as is effectively done in the tournament selection scheme [Reeves95].

Tournament selection takes its name from performing a tournament on a random subset of t solutions from the population of size m (GA.populationsize parameter), and choosing the solution with the highest fitness value as the winner. This selection is performed m times, and the subsets are determined by randomly perturbing the population sequence, and taking successive groups of t elements. Once the pool of solutions is exhausted, a new permutation is created; t repetitions are needed overall. Setting t=2 results in an adequate selective pressure [Surry95], in that the best solution will be chosen twice, the worst not at all, and the median once on average.

Successive pairs of tournament winners are subject to the combined crossover and mutation operator twice, to yield two new solutions. m needs to be divisible by 2t to ensure that these pairs come from a single permutation, and hence avoid the possibility of a solution being mated with itself. A new generation of m solutions is created in this way, and the algorithm can proceed to the next iteration. If a superior solution is encountered it is noted, otherwise the best solution found so far replaces the worst solution in the new population. This mechanism is a form of elitist selection [deJong75], and guarantees monotonic improvement of maximum population fitness, and ultimately convergence. Termination occurs after a set number of iterations (GA.iterationlimit parameter) have been performed without finding an improving solution.

   
Majorization Algorithm

Iterative Majorization (IM) [Ortega70,deLeeuw77] belongs to a class of descent techniques for function minimisation [Press92], as it generates a non-increasing sequence of function values. If a complicated function f to be minimised is bounded from below, like Energy (3.13) for example, IM will converge to a local minimum of f. Instead of minimising f(x) directly, an auxiliary function g(x,z), where z is a constant, is defined and minimised, to get an approximation to a minimum of f(x). By choosing g to be quadratic in x, its unique minimum can be found analytically, although other function forms can be useful, too. g has to meet two conditions to be called a majorizing function of f:
1.
$f(x)\le g(x,z)$, for all x
2.
f(z)=g(z,z), with z referred to as the supporting point
Assuming that x* is the minimum of g(x,z) over x, the majorizing conditions imply the following chain of inequalities:

 \begin{displaymath}
f(x^*)\le g(x^*,z)\le g(z,z)=f(z)
\end{displaymath} (3.23)

Once x* is found it is used as the new supporting point: z'=x*, and thus a better approximation to the minimum of f is established iteratively.

The principle of IM can be easily generalised to multivariable functions. There exists a quadratic majorizing function for weighted raw Stress (3.17), and it forms the basis of the SMACOF algorithm [deLeeuw77,Borg97] (the acronym stands for Scaling by MAjorizing a COmplicated Function). The scale dependency of (3.17) (see Section 3.2.2) is circumvented by explicitly normalising $\{\hat{d}_{ij}\}$, so that $\sum_{i<j}w_{ij}\hat{d}_{ij}^2=n(n-1)/2$. By setting each $w_{ij}=\delta_{ij}^{-2}$ this algorithm can be made to minimise raw Energy (3.16). The explicit normalisation under the ratio MDS model ( $\hat{d}_{ij}=a\delta_{ij}$) becomes:

\begin{eqnarray*}\sum_{i<j}\delta_{ij}^{-2}a^2\delta_{ij}^2&=&\frac{n(n-1)}{2}\\
a^2\sum_{i<j}1&=&\sum_{i<j}1\\
a&=&1
\end{eqnarray*}


that is we arrive at the absolute MDS model with loss function (3.16).

The most recent implementation of the SMACOF algorithm is ProxScal version 6.4 [Busing97], and we decided to use it as a benchmark for our MDS algorithms (see section 3.5). The stopping criteria for the algorithm are met when either the rate of change of Energy (3.13) drops below a specified threshold value (ProxScal.rate parameter), or a maximum number of iterations has been reached (ProxScal.iterationlimit parameter).

   
Simulated Annealing

Simulated Annealing (SA) [Dowsland95] generalises descent methods of local optimisation (see Section 3.3.5 for an example) by allowing uphill moves, i.e. ones increasing the value of the loss function. It is hoped that by introducing such moves the method will be able to escape local minima, and ultimately converge to a global minimum% latex2html id marker 17429
\setcounter{footnote}{3}\fnsymbol{footnote}. To satisfy this objective the acceptance probability for an uphill move has to be related to the magnitude of the increase, and the optimisation history. A convenient formula can be borrowed from thermodynamics:

 \begin{displaymath}
p(\delta E)=\exp \left(-\frac{\delta E}{kT}\right)
\end{displaymath} (3.24)

which gives the probability of an increase $\delta E$ in energy at temperature T. k is Boltzmann's constant, and is obviously meaningless in the function minimisation context, thus the whole denominator is replaced by a single parameter t, although the thermal nomenclature is retained. For negative $\delta E$ the formula (3.24) produces $p(\delta E)>1$, thus a downhill move is always accepted. Initially the temperature is high, and so is the probability of accepting an uphill move of a given magnitude. During the course of the optimisation the temperature is reduced, and so is the associated probability, until both attain low enough values for the procedure to effectively reduce to a very slow descent, at which stage it can be stopped. SA draws obvious analogies to the physical process of annealing, as it originates from simulation studies in this field [Metropol53,Kirkpatr83].

At a given temperature t all n points in the configuration are considered one at a time. A move is generated by adding a deviate from the Cauchy distribution to each coordinate of point $\vec{x}_i$:

 \begin{displaymath}
\tilde{x}_{ik}=x_{ik}+t\tan p
\end{displaymath} (3.25)

where p is drawn uniformly from $(-\pi/2,\pi/2)$, and temperature t is acting as the scale parameter of the distribution. The update formula (3.25) will produce Gaussian like local displacement most of the time, with an occasional long jump, which allows radical moves to be tried even at low temperatures in a controlled manner. Thus temperature can be reduced quicker, compared to using a deviate from the Gaussian distribution in (3.25), while maintaining the properties of global optimisation [Szu87].

The new value of raw Energy (3.16) can be calculated efficiently since only one point is affected, by maintaining a breakdown of the contribution (3.22) of each point to the overall value, as in Section 3.3.3. Rather than calculating the acceptance probability from (3.24), and comparing it with a uniform random number $p\in[0,1]$ to decide whether to accept the move, the maximum acceptable value of (3.22) can be calculated in advance by rearranging (3.24):

 \begin{displaymath}
\xi=\sigma_{Er}(\vec{x}_i)-t\ln p
\end{displaymath} (3.26)

allowing the calculation of $\sigma_{Er}(\tilde{\vec{x}}_i)$ to be aborted as soon as it exceeds $\xi$ [Wright89]. This modification improves the performance, especially when the rejection ratio is high. There is no need to randomise the order of points, as in Section 3.3.2, since not all the moves will be accepted according to (3.26), and thus the sequence of accepted moves will automatically be scrambled.

Central to the design of a simulated annealing algorithm is the choice of a cooling schedule, i.e. the manner in which the temperature parameter t is reduced. It is a common observation that most of the useful work is done in the middle of the schedule [Dowsland95], and thus it is important to chose an appropriate range of temperatures, over which the cooling occurs. However, the optimal temperature range is not only application dependent, but will be different for every instance of the problem, in general. The actual temperature reduction function is less important, and a geometric function ti+1=ati is the most common choice in practice, for some $a\in(0,1)$ (SA.geometric parameter) [Dowsland95].

Since the amount of useful work done by the algorithm is the overriding consideration, we decided to specify a target acceptance ratio $\bar{\mu}$ (SA.acceptance parameter) instead of prescribing the temperature range [Dowsland93]. The actual acceptance ratio $\mu$ is the proportion of both downhill and uphill moves accepted to the total number of moves n at each temperature. At high temperature many moves generated by (3.25) will result in a significant increase to (3.22), and according to formula (3.26) will be rejected, thus $\mu$ will be low. Conversely, at lower temperature more moves will be downhill, and the remaining uphill ones are likely to be more local, and cause smaller increases to (3.22), therefore, $\mu$ will be higher.

Temperature is updated according to the following formula:

\begin{displaymath}t_{i+1}=a^{(\bar{\mu}-\mu_i)n}t_i
\end{displaymath} (3.27)

If temperature ti is too high $\mu_i$ will be less than $\bar{\mu}$, and for each deficit acceptance temperature is multiplied by a, and thus reduced in proportion to the deficit. Conversely, too low a value of ti causes $\mu_i$ to be greater than $\bar{\mu}$, and the temperature is divided by a for each surplus acceptance, i.e. proportionally increased. Therefore, this schedule will eventually converge at iteration j to a state where $\mu_k\approx \bar{\mu}$, for all subsequent iterations $k\ge j$. The speed of this convergence is dependent on the value of a and t0 (SA.initialtemperature parameter), although we expect the algorithm to be fairly insensitive to their exact choice, as only the value of $\bar{\mu}$ determines the equilibrium temperature. The algorithm goes through a set number of temperature updates (SA.iterationlimit parameter), and maintains a record of the best solution generated.

   
Experimental Setup

In order to evaluate the relative merits of each algorithm described in Section 3.3, we used a test bed consisting of 66 data collections: 2 dissimilarity matrices, 2 image collections, 6 graphs, and 56 data tables (see Section B.1 and B.2 in the appendices for details). The number of objects in a collection ranged from 9 to 1484, with the median value of 159. Since least-squares MDS algorithms are heuristics for function minimisation, and can only be expected to find local minima in general, we decided to scale each data collection 10 times, to give us a more complete picture of the algorithms' performance. We recorded the minimum and average Energy (3.13) values of 10 trials, and the mean running time per trial for every combination of data collection and algorithm. PCO is an analytical method (see Section 3.3.1), and only a single trial per data collection was necessary.

Parameters of the algorithms were fixed to the following values:


 
Table 3.1: Wilcoxon test results for SA.acceptance parameter


relationship1 # cases2 rank sum3 p 4
$10\downarrow 5$ 53 1938 0.001
$10\uparrow 5$ 13 273  
$15\downarrow 10$ 49 1868 0.001
$15\uparrow 10$ 17 343  
$20\downarrow 15$ 43 1599 0.001
$20\uparrow 15$ 23 612  
$25\downarrow 20$ 39 1384 0.076
$25\uparrow 20$ 27 827  
$30\downarrow 25$ 36 1109 0.985
$30\uparrow 25$ 30 1102  

(a) minimum Energy

 
relationship # cases rank sum p
$10\downarrow 5$ 50 1861 0.001
$10\uparrow 5$ 16 350  
$15\downarrow 10$ 41 1564 0.003
$15\uparrow 10$ 25 647  
$20\downarrow 15$ 32 1157 0.746
$20\uparrow 15$ 34 1054  
$25\downarrow 20$ 30 1151 0.775
$25\uparrow 20$ 36 1060  
$30\downarrow 25$ 27 778  
$30\uparrow 25$ 39 1433 0.036

(b) average Energy



1 $t\downarrow u$ denotes that Energy for acceptance ratio $t\%$ is lower than for $u\%$, and $t\uparrow u$ implies the opposite
2 the number out of 66 cases for which the relationship holds
3 absolute differences in Energy for both values of the acceptance ratio parameter across all cases are ordered on their magnitude. Ranks for which differences are negative are summed separately from the positive ones, and their total is $\sum_{r=1}^{66}r=2211$. These two sums should be roughly equal under H0: no overall difference in Energy for both acceptance ratios
4 probability of rejecting H0 when it is true, nominally an acceptable value - the significance level of the test - is no greater than 5%

 

   
Statistical Analysis

The statistical analysis of MDS algorithms of Section 3.3 is structured into three distinct parts: minimum Energy, average Energy, and running time experiments, each of which consists of 6 related measurements, one for each algorithm, made for every data collection in the test bed. For such an experimental design there are two statistical tests: two-way analysis of variance, or its non-parametric cousin, the Friedman test [Siegel56]. The former can only be used if measurements are independently drawn from normally distributed populations, these populations have identical variance, and their means are linear combinations of effects due to differences in algorithms and data collections. The latter test is more general and robust, as it does not make any of these assumptions, and we preferred to use it for our statistical analysis.

   
Minimum Energy


 
Table 3.2: Friedman test results for minimum and average Energy

algorithm minimum average
NR 2.86 3.27
TS 2.29 2.80
GA 4.11 3.89
SA 3.27 2.67
IM 2.55 2.61
PCO 5.92 5.76
$\chi_r^2$ 1 171.38 137.33
W 2 0.519 0.416


1 $\chi_r^2$ is Friedman's chi-square statistic; the critical value at p=0.001 level is 20.52
2 W is Kendall's coefficient of concordance; it ranges between 0 (no agreement between cases) to 1 (complete agreement); the critical value at p=0.001 level is 0.062
The figures in rows 2-7 are mean ranks across 66 cases, which are computed by summing the relative order of minimum or average Energy for a particular algorithm over all cases, and dividing by their total number

A Friedman test has been carried out with the null hypothesis H0 of no difference between algorithms with regard to minimum Energy, and the alternative hypothesis H1 of at least one algorithm having a different ranking from the others; the results are presented in the second column of Table 3.2. The values of both Friedman's chi-square statistic and Kendall's coefficient of concordance are statistically significant at a level exceeding 0.001, which allows H0 be rejected in favour of H1. This permits the use of the Tukey Multiple Comparison procedure to test for significant differences between algorithm rankings in pairs [Keselman77].


 
Table 3.3: Tukey Multiple Comparison results


rank comparison Q 1 p 2
PCO>TS 15.79 * 3
PCO>IM 14.64 *
PCO>NR 13.32 *
PCO>SA 11.51 *
PCO>GA 7.90 *
GA>TS 7.90 *
GA>IM 6.74 *
GA>NR 5.43 *
GA>SA 3.62 0.003
SA>TS 4.28 *
SA>IM 3.13 0.010
SA>NR $\ll 4.03$ 0.244
NR>TS 2.47 0.008
NR>IM $\ll 4.03$ 0.037
IM>TS $\ll 4.03$ 0.089

(a) minimum Energy

 
rank comparison Q p
PCO>IM 13.69 *
PCO>SA 13.42 *
PCO>TS 12.83 *
PCO>NR 10.79 *
PCO>GA 8.09 *
GA>IM 5.59 *
GA>SA 5.33 *
GA>TS 4.74 *
GA>NR 2.70 0.103
NR>IM 2.90 0.005
NR>SA $\ll 4.03$ 0.003
NR>TS $\ll 4.03$ 0.045
TS>IM $\ll 4.03$ 0.165
TS>SA $\ll 4.03$ 0.159
SA>IM $\ll 4.03$ 0.439

(b) average Energy



1 Q is Tukey's stati