next up previous
Next: Large Scale Visualization Up: Proximity Visualization of Abstract Data Previous: Multivariate Visualization Techniques

Subsections

   
Multidimensional Scaling

This chapter presents an evaluation of algorithms for generating proximity visualizations. Dissimilarities between pairs of objects from a data collection are given by a suitable coefficient, and then approximated by distances between corresponding pairs of entities in a visual representation. The quality of this approximation is expressed as a loss function, which yields the optimal arrangement at its minimum. The algorithms under test are heuristics for numerically minimising this function. Their effectiveness is measured for a representative sample of data collections, and the relative ranking is statistically inferred, as well as illustrated with examples.

   
Problem Definition

Suppose that we are given a collection of n objects and a way of determining the dissimilarity between any pair $\vec{\Delta}=[\delta_{ij} : i,j=1,\ldots,n]$; as set out in Section 1.2. Metric Multidimensional Scaling (MDS) [Borg97,Cox94] is a procedure for finding a configuration $\vec{X}^*=[x_{ia}^* : a=1,\ldots,p]$ of n points in a p-dimensional space, usually Euclidean, such that point $\vec{x}_i^*=(x_{i1}^*,\ldots,x_{ip}^*)^T$ uniquely represents object i, and the Euclidean distance between points $\vec{x}_i^*$ and $\vec{x}_j^*$:

 \begin{displaymath}
d_{ij}(\vec{X}^*)=\left\Vert\vec{x}_i^*-\vec{x}_j^*\right\Vert=\sqrt{\sum_{a=1}^p\left(x_{ia}^*-x_{ja}^*\right)^2}
\end{displaymath} (3.1)

approximates the corresponding dissimilarity $\delta_{ij}$, for all pairs of objects (i,j):

 \begin{displaymath}
\mathop{\forall}_{i<j}d_{ij}(\vec{X}^*)\approx \delta_{ij}
\end{displaymath} (3.2)

In general it is sufficient to consider each pair of objects (i,j) just once: i<j, since dissimilarities are assumed to be symmetric. Elements of an asymmetric matrix $\vec{\hat{\Delta}}$ have to be averaged out prior to the analysis: $\delta_{ij}=\delta_{ji}=(\hat{\delta}_{ij}+\hat{\delta}_{ji})/2$ [Borg97].

   
Loss Functions

For a given configuration $\vec{X}$ the approximation error in representing the dissimilarity between objects i and j can be defined as follows:

 \begin{displaymath}
e_{ij}\stackrel{\mathrm{def}}{=}\left\vert d_{ij}(\vec{X})-\delta_{ij}\right\vert
\end{displaymath} (3.3)

A least-squares MDS technique defines a loss function that is a weighted and possibly normalised sum of errors over all pairs of objects (i,j), and thus it penalises the overall approximation error. A minimum of this function over $\vec{X}$ is subsequently found through numerical optimisation, to yield the desired configuration $\vec{X}^*$.

   
Raw Stress

Raw Stress [Kruskal64] is the most elementary MDS loss function, as it simply accumulates the total squared representation error:

 \begin{displaymath}
\sigma_r(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}e_{ij...
...right)^2=\sum_{i<j}\Big(d_{ij}(\vec{X})-f(\delta_{ij})\Big)^2
\end{displaymath} (3.4)

where $\hat{d}_{ij}$ is the disparity between objects i and j, which is arrived at by applying transformation f to the given dissimilarity $\delta_{ij}$. Since dissimilarities are calculated from object attributes, relationships, or other features (see Section 1.2), and there is no error or uncertainty associated with this process, the correct way to model the disparities is to apply a linear transformation to the dissimilarities: $\hat{d}_{ij}=f(\delta_{ij})=a\delta_{ij}$, where a is chosen to minimise the value of Stress.

The optimal a, which shall be denoted as a*, can be found analytically by differentiation. Alternating this step with an iterative improvement to $\vec{X}$ provides an efficient procedure for finding the solution $\vec{X}^*$ satisfying $\vec{\Delta}$, termed ratio MDS [Borg97]. By considering the dissimilarities directly, we restrict ourselves to an identity transformation, and arrive at an absolute MDS model. At the other end of the spectrum, the transformation can be relaxed just to be monotonic, so that only the rank order of dissimilarities is preserved, because it is assumed to be the only reliable information (see Section 1.2.9), which gives an ordinal (nonmetric) MDS model. Nonmetric MDS can be approximated with the ratio model by replacing the dissimilarities with their rank order a priori [Weeks79,Borg97]. Therefore, we conclude that ratio MDS is the most applicable for visualization, and focus our efforts on this model.

   
Normalised Stress

Raw Stress (3.4) can only be successfully minimised in practice under the absolute MDS model. If we permit a transformation of the dissimilarities then by alternating an update of the configuration $\vec{X}$ with an update to the disparities $[\hat{d}_{ij}]=a\vec{\Delta}$ $\sigma_r(\vec{X},a)$ can be made arbitrarily small, with a and $\vec{X}$ shrinking gradually. This deficiency can be overcome by normalising raw Stress by $\sum_{i<j}\hat{d}_{ij}^2$ [Borg97]:

 \begin{displaymath}
\sigma_n(\vec{X})\stackrel{\mathrm{def}}{=}\frac{\sum_{i<j...
...ij}(\vec{X})-\hat{d}_{ij}\right)^2}{\sum_{i<j}\hat{d}_{ij}^2}
\end{displaymath} (3.5)

Loss function (3.5), termed normalised Stress, expands for ratio MDS to:
 
$\displaystyle \sigma_n(\vec{X},\alpha)$ = $\displaystyle \frac{\sum_{i<j}\left(d_{ij}(\vec{X})-\alpha\delta_{ij}\right)^2}{\sum_{i<j}\alpha^2\delta_{ij}^2}$  
  = $\displaystyle \frac{\sum_{i<j}d_{ij}^2(\vec{X})-2\alpha\sum_{i<j}\delta_{ij}d_{ij}(\vec{X})+\alpha^2\sum_{i<j}\delta_{ij}^2}{\alpha^2\sum_{i<j}\delta_{ij}^2}$  
  = $\displaystyle \frac{\eta^2(\vec{X})-2\alpha\rho(\vec{X})+\alpha^2\eta_\delta^2}{\alpha^2\eta_\delta^2}$  
  = $\displaystyle 1-\frac{2\alpha\rho(\vec{X})-\eta^2(\vec{X})}{\alpha^2\eta_\delta^2}$ (3.6)

where $\eta^2(\vec{X})$ is a sum of the squared distances $d_{ij}^2(\vec{X})$, $\rho(\vec{X})$ is a weighted sum of distances $\delta_{ij}d_{ij}(\vec{X})$, and $\eta_\delta^2$ is a constant equal to the sum of the squared dissimilarities $\delta_{ij}^2$, over all pairs of objects (i,j). The optimal $\alpha$ can be found by differentiating (3.6) with respect to $\alpha$:

\begin{eqnarray*}\frac{\partial\sigma_n(\vec{X},\alpha)}{\partial \alpha}&=&\fra...
...rac{\alpha\rho(\vec{X})-\eta^2(\vec{X})}{\alpha^3\eta_\delta^2}
\end{eqnarray*}


which is equal to zero for $\alpha^*=\eta^2(\vec{X})/\rho(\vec{X})$. Inserting $\alpha^*$ into (3.6) yields:
 
$\displaystyle \sigma_n(\vec{X},\alpha^*)$ = $\displaystyle 1-\frac{\displaystyle 2\rho(\vec{X})\frac{\eta^2(\vec{X})}{\rho(\...
...2(\vec{X})}{\displaystyle \eta_\delta^2\frac{\eta^4(\vec{X})}{\rho^2(\vec{X})}}$  
  = $\displaystyle 1-\left(\frac{\rho(\vec{X})}{\eta_\delta\eta(\vec{X})}\right)^2$  
  = $\displaystyle 1-\left(\frac{\sum_{i<j}\delta_{ij}d_{ij}(\vec{X})}{\left(\sum_{i...
...)^{\frac{1}{2}}\left(\sum_{i<j}d_{ij}^2(\vec{X})\right)^{\frac{1}{2}}}\right)^2$  
  = $\displaystyle 1-c\left(\vec{\Delta},\left[d_{ij}(\vec{X})\right]\right)^2$ (3.7)

The last term of (3.7) is the square of Tucker's congruence coefficient c with dissimilarities and distances [Tucker51]. The coefficient is always between 0 and 1, due to the Cauchy-Schwarz inequality:

 \begin{displaymath}
\sum_rp_rq_r\le\left(\sum_rp_r^2\right)^{\frac{1}{2}}\left(\sum_rq_r^2\right)^{\frac{1}{2}}
\end{displaymath} (3.8)

and the fact that dissimilarities and distances are nonnegative. Hence, it holds that $0\le\sigma_n(\vec{X},\alpha^*)\le1$, for the optimal scaling constant $\alpha$.

It is worth noting that scaling the dissimilarities by a factor $\alpha$ is equivalent to scaling the distances by $1/\alpha$:

\begin{eqnarray*}\sigma_n(\vec{X},\alpha)&=&\frac{\sum_{i<j}\left(d_{ij}(\vec{X}...
...}{\sum_{i<j}\delta_{ij}^2}\;\;=\;\;\sigma_n(\alpha^{-1}\vec{X})
\end{eqnarray*}


The derivation of $\sigma_n(b^*\vec{X})$ is equivalent to that presented above [deLeeuw77,Borg97], and yields $b^*=1/\alpha^*$. However, it relies on the assumption that $\vec{X}$ is a local minimum of (3.5), whereas our derivation does not require such an assumption, and thus is applicable to any configuration $\vec{X}$.

   
Kruskal's Stress

Stress formula 1, or Stress-1 for short, is historically an earlier solution to the scale dependency problem of raw Stress (3.4), with the normalising factor $\eta^2(\vec{X})$ [Kruskal64]:

 \begin{displaymath}
\sigma_1^2(\vec{X})\stackrel{\mathrm{def}}{=}\frac{\sum_{i...
...(\vec{X})-\hat{d}_{ij}\right)^2}{\sum_{i<j}d_{ij}^2(\vec{X})}
\end{displaymath} (3.9)

The whole formula is square rooted by analogy of referring to the standard deviation instead of the variance.

Substituting the ratio MDS model yields:

 
$\displaystyle \sigma_1^2(\vec{X},\beta)$ = $\displaystyle \frac{\sum_{i<j}\left(d_{ij}(\vec{X})-\beta\delta_{ij}\right)^2}{\sum_{i<j}d_{ij}^2}$  
  = $\displaystyle \frac{\eta^2(\vec{X})-2\beta\rho(\vec{X})+\beta^2\eta_\delta^2}{\eta^2(\vec{X})}$  
  = $\displaystyle 1-\frac{2\beta\rho(\vec{X})-\beta^2\eta_\delta^2}{\eta^2(\vec{X})}$ (3.10)

The minimum of (3.10) over $\beta$ is obtained by setting the first derivative with respect to $\beta$ equal to 0:

\begin{displaymath}\frac{\partial\sigma_1^2(\vec{X},\beta)}{\partial \beta}=2\frac{\beta\eta_\delta^2-\rho(\vec{X})}{\eta^2(\vec{X})}=0\nonumber
\end{displaymath}

which yields $\beta^*=\rho(\vec{X})/\eta_\delta^2$. Inserting $\beta^*$ into (3.10) gives:

 \begin{displaymath}
\sigma_1^2(\vec{X},\beta^*)=1-\left(\frac{\rho(\vec{X})}{\...
...^2=\sigma_n(\vec{X},\alpha^*)\stackrel{\mathrm{def}}{=}\sigma
\end{displaymath} (3.11)

Therefore, for a given configuration $\vec{X}$, normalised Stress (3.5) is equivalent to the square of Stress-1 (3.9) if we allow for an optimal scaling of dissimilarities in each case, and we shall refer to this unified index as Stress ($\sigma$). Since such a scaling step is an integral part of any ratio MDS procedure, and is alternated with an iterative improvement to $\vec{X}$, the final solution will be a local minimum of both of these loss functions. A graphical illustration of this equivalence is presented in Figure 3.1(a) and 3.1(b). The relationship (3.11) can also be demonstrated by taking the alternative route of rescaling the configuration $\vec{X}$ [Borg97], however an assumption that $\vec{X}$ is a local minimum has to be made again.
    
Figure 3.1: Visualizations of a 6 level complete binary tree computed by minimising different loss functions. The dissimilarity between a pair of nodes is the length of the path connecting them (see Section 1.2.6)


\begin{picture}(999,999)(-0.5,-0.5)%
\put(499,500){\circle{35}}%
\put(395,50...
...2,116){70.2512}}%
\put(776.159,785.912){\line(78,127){59.6828}}%
\end{picture}

\begin{picture}(999,999)(-0.5,-0.5)%
\put(500,500){\circle{35}}%
\put(604,50...
...,127){58.8542}}%
\put(221.126,784.711){\line(-92,116){70.2512}}%
\end{picture} \begin{picture}(999,999)(-0.5,-0.5)%
\put(500,500){\circle{35}}%
\put(378,50...
...6,-35){72.7649}}%
\put(873.595,278.756){\line(53,-94){35.8101}}%
\end{picture}

(a) $\sigma _n=0.04024$ (b) $\sigma _1^2=0.04021$ (c) $\displaystyle \sigma _E=0.05934\atop \displaystyle (\sigma =0.05054)$


   
Energy

Any of the loss functions defined previously can be generalised to weight the contribution of individual pairs of objects, for example weighted normalised Stress acquires the following form:

 \begin{displaymath}
\sigma_{n,w}(\vec{X})\stackrel{\mathrm{def}}{=}\frac{\sum_...
...ec{X})-\hat{d}_{ij}\right)^2}{\sum_{i<j}w_{ij}\hat{d}_{ij}^2}
\end{displaymath} (3.12)

Each weight wij can take any non-negative value, as long as it does not depend on $\vec{X}$. Typical use allows missing dissimilarities to be accommodated by setting the corresponding weights to 0, and the remaining ones to 1. However, by setting $w_{ij}=\hat{d}_{ij}^{-2}$ a new loss function can be obtained, which we shall refer to as Energy:

 \begin{displaymath}
\sigma_E(\vec{X})\stackrel{\mathrm{def}}{=}\frac{2}{n(n-1)...
...{\left(d_{ij}(\vec{X})-\hat{d}_{ij}\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.13)

The ratio MDS model produces the following expansion, with $\displaystyle\frac{1}{\lambda}=\frac{n(n-1)}{2}={n\choose 2}=\sum_{i<j}1$:
 
$\displaystyle \sigma_E(\vec{X},\gamma)$ = $\displaystyle \lambda\sum_{i<j}\frac{\left(d_{ij}(\vec{X})-\gamma\delta_{ij}\right)^2}{\gamma^2\delta_{ij}^2}$  
  = $\displaystyle \lambda\left(\sum_{i<j}1-\sum_{i<j}\frac{2\gamma \delta_{ij}d_{ij}(\vec{X})-d_{ij}^2(\vec{X})}{\gamma^2\delta_{ij}^2}\right)$  
  = $\displaystyle 1-\lambda\left(\frac{2}{\gamma}\sum_{i<j}\frac{d_{ij}(\vec{X})}{\...
...ij}}-\frac{1}{\gamma^2}\sum_{i<j}\frac{d_{ij}^2(\vec{X})}{\delta_{ij}^2}\right)$  
  = $\displaystyle 1-\lambda\left(\frac{2}{\gamma}\psi(\vec{X})-\frac{1}{\gamma^2}\omega(\vec{X})\right)$ (3.14)

An optimal $\gamma$ can be found by differentiating (3.14) with respect to $\gamma$:

\begin{eqnarray*}\frac{\partial\sigma_E(\vec{X},\gamma)}{\partial\gamma}&=&-\lam...
...mbda}{\gamma^3}\left(\gamma\psi(\vec{X})-\omega(\vec{X})\right)
\end{eqnarray*}


which is equal to zero for $\gamma^*=\omega(\vec{X})/\psi(\vec{X})$. Substituting $\gamma^*$ into (3.14) yields:
 
$\displaystyle \sigma_E(\vec{X},\gamma^*)$ = $\displaystyle 1-\lambda\frac{\psi^2(\vec{X})}{\omega(\vec{X})}$  
  = $\displaystyle 1-\lambda\frac{\displaystyle\left(\sum_{i<j}\frac{d_{ij}(\vec{X})...
...{ij}}\right)^2}{\displaystyle\sum_{i<j}\frac{d_{ij}^2(\vec{X})}{\delta_{ij}^2}}$  
  = $\displaystyle 1-\frac{\displaystyle\left(\sum_{i<j}\frac{\delta_{ij}d_{ij}(\vec...
...\delta_{ij}^2}{\delta_{ij}^2}\sum_{i<j}\frac{d_{ij}^2(\vec{X})}{\delta_{ij}^2}}$  
  = $\displaystyle 1-\left(\frac{\sum_{i<j}w_{ij}\delta_{ij}d_{ij}(\vec{X})}{\left(\...
...ac{1}{2}}\left(\sum_{i<j}w_{ij}d_{ij}^2(\vec{X})\right)^{\frac{1}{2}}}\right)^2$  
  = $\displaystyle 1-c_w\left(\vec{\Delta},[d_{ij}(\vec{X})]\right)^2$ (3.15)

The last term of (3.15) is the square of weighted Tucker's congruence coefficient cw with dissimilarities and distances [Tucker51]. The coefficient is always between 0 and 1, due to the Cauchy-Schwarz inequality (3.8). Hence, it holds that $0\le\sigma_E(\vec{X},\gamma^*)\le1$, for the optimal scaling constant $\gamma$.

Energy (3.13) penalises error (3.3) in representing short dissimilarities more than the same error for larger dissimilarities. Such a proportional contribution of error is likely to agree with user's expectations, because it does not matter exactly how distant dissimilar objects are from a given object, as long as they are far in relation to similar ones. Figure 3.1 gives a visual comparison of the results of minimising Stress (3.11) and Energy. Stress, on the other hand, minimises absolute error, since errors in representing large and small dissimilarities are penalised equally, and does not preserve the local detail as well as Energy. Therefore, we prefer to use Energy for proximity visualization, with the exception of Chapter 5.

For simplicity we drop the normalising constant $\lambda$ when deriving MDS algorithms, and minimise raw Energy instead:

 \begin{displaymath}
\sigma_{Er}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}\frac{\left(d_{ij}(\vec{X})-\hat{d}_{ij}\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.16)

which can be derived from the weighted form of raw Stress (3.4):

 \begin{displaymath}
\sigma_{r,w}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}w_{ij}\left(d_{ij}(\vec{X})-\hat{d}_{ij}\right)^2
\end{displaymath} (3.17)

by setting the appropriate weights. We note that minimising (3.16) is equivalent to minimising (3.13), and we use the normalised form when reporting results.

(3.16) may be interpreted as the total energy of a fully connected spring system, with an anchor for each object, and springs connecting it to all other anchors. The relaxed length of a spring connecting two anchors is given by the optimally scaled dissimilarity between the corresponding pair of objects. The actual length of the spring is the Euclidean distance between the anchors. The spring constant is $\hat{d}_{ij}^{-2}$ for a spring connecting anchors i and j, which causes the spring to be rigid if $\delta_{ij}$ is small in relation to other dissimilarities, and therefore to contribute substantially to the value of (3.16) if deformed. Equilibrium of this spring system corresponds to a local energy minimum.

   
Algorithms

Each of the least-squares MDS algorithms presented here is an instance of a well-known function minimisation heuristic. Such a heuristic constitutes a theoretical framework, and typically a large number of options or variations contributed by practitioners in the operational research and other fields. Some of the choices are simply determined by the application domain, others can be made on theoretical grounds, or empirical evidence - either found in literature or established on our own. In each case we outline the heuristic framework, giving references to dedicated texts, and detail the decisions and customisations made to provide a complete algorithm.

   
Classical Scaling

Classical scaling [Gower66,Cox94,Borg97] is an analytical technique for solving the metric MDS problem, and it was the first to be developed. Dissimilarities are treated as Euclidean distances, and a matching configuration $\vec{Z}$ is recovered from the eigendecomposition of $-\frac{1}{2}\vec{J}\vec{\Delta}^{(2)}\vec{J}=\vec{Z}\vec{Z}^T$, where $\vec{J}$ is the centring matrix, and $\vec{\Delta}^{(2)}=[\delta_{ij}^2]$. $p\le n-1$ eigenvalues will be non-zero, and they represent the minimum number of dimensions required to preserve all the dissimilarities, so that (3.2) holds exactly. Selecting p'<p largest eigenvalues and corresponding eigenvectors is equivalent to selecting the first p' principal components $\vec{Y}$ of the p-dimensional configuration $\vec{Z}$, i.e. the projection of $\vec{Z}$ onto p' orthogonal axes that preserves the maximum variance (see Section 4.3). This combined procedure is termed Principal Coordinate Analysis (PCO)% latex2html id marker 17163
\setcounter{footnote}{2}\fnsymbol{footnote} [Gower66], and analytically minimises the following criterion over $\vec{Y}$:

 \begin{displaymath}\sum_{i<j}\left(d_{ij}^2(\vec{Z})-{d}_{ij}^2(\vec{Y})\right) ...
...v \sum_{i<j}\left(\delta_{ij}^2- {d}_{ij}^2(\vec{Y})\right)
\end{displaymath} (3.18)

Note, that under the projection ${d}_{ij}(\vec{Y}) \le d_{ij}(\vec{Z})=\delta_{ij}$, for all pairs (i,j), i.e. distortion of the dissimilarities will be biased, unlike for the least-squares MDS loss functions of Section 3.2.

   
Newton-Raphson

The initial MDS algorithm that we have developed is based on the generalised Newton-Raphson (NR) method [Press92]. It is an iterative technique for solving a vector equation of the form $\vec{f(x)=0}$. If $\vec{f(x)}$ represents the first partial derivatives of a multidimensional function $e(\vec{x})$ this procedure will find an extremum of this function.

The gradient vector $\vec{g}$ and the Hessian matrix $\vec{H}$ of loss function (3.16) at point $\vec{x}_k=\left(x_{k1},\ldots,x_{kp}\right)^T$ can be established analytically (see Appendix A for details):

$\displaystyle g_a(\vec{x}_k)$ = $\displaystyle \frac{\partial \sigma_{Er}(\vec{X})}{\partial x_{ka}}=2\sum_{l\ne...
...rac{d_{kl}(\vec{X})-\hat{d}_{kl}}{d_{kl}(\vec{X})\hat{d}_{kl}^2}(x_{ka}-x_{la})$  
$\displaystyle H_{ab}(\vec{x}_k)$ = $\displaystyle \frac{\partial ^2\sigma_{Er}(\vec{X})}{\partial x_{ka}\partial x_...
...{X})-(x_{ka}-x_{la})^2}{d_{kl}^3(\vec{X})\hat{d}_{kl}} & a=b
\end{array}\right.$  

An extremum of the loss function with respect to point $\vec{x}_k$ can then found by repeatedly applying the NR iteration:

\begin{eqnarray*}\tilde{\vec{x}}_k&=&\vec{x}_k-\vec{H}^{-1}\vec{g}\\
\vec{H}\left(\tilde{\vec{x}}_k-\vec{x}_k \right)&=&-\vec{g}
\end{eqnarray*}


Let the ith column of $\vec{H}$ be denoted by $\vec{H}_i=\left(H_{1i},\ldots,H_{pi}\right)^T$
 
$\displaystyle \tilde{\vec{x}}_k$ = $\displaystyle \vec{x}_k+\left(\frac{\det\left(-\vec{g},\vec{H}_2,\ldots,\vec{H}...
...ac{\det\left(\vec{H}_1,-\vec{g}, \ldots,\vec{H}_p\right)}{\det \vec{H}},\right.$  
    $\displaystyle \left.\cdots,\frac{\det\left(\vec{H}_1,\vec{H}_2,\ldots,-\vec{g}\right)}{\det \vec{H}}\right)^T$ (3.19)

Figure 3.2(a) depicts the Energy function (3.13) for a trivial case when there is a single constant (fixed) point $\vec{x}_f$ to optimise $\vec{x}_k$ against in 2 dimensions. A circle of radius $\hat{d}_{fk}$ centred at $\vec{x}_f$ is the set of all minima. There is a local maximum at $\vec{x}_f$ where the function takes value 1. However, it is a singular point, as is apparent from Figures 3.2(b) and 3.2(c); consequently, gradient $\vec{g}$ does not have a root at $\vec{x}_f$. Moreover, if $\vec{x}_k$ gets inside the minimum circle it will be repelled with a subsequent NR iteration (3.19), because the direction of $\vec{g}$ is reversed over the cusp at point $\vec{x}_f$, whereas Hessian $\vec{H}$ is not affected. This argument extends to the case of multiple constant points; see Figure 3.3 for an illustration. Because of the parabolic characteristic of the component Energy functions, no new maxima can be introduced by their summation, and thus the NR method will only converge to a minimum. Unlike the steepest descent method [Press92] (also see Section 3.3.5) an upward move - a transition from $\vec{x}_k$ to $\tilde{\vec{x}}_k$ that increases Energy - is possible with an NR iteration, especially if the current solution is far from a minimum. This will prevent the method from getting stuck in a poor local minimum, in general.


   
Figure 3.2: Plots of the Energy function and its first partial derivatives for one fixed point in 2 dimensions

\epsfig{file=images/energy.eps, width=4cm}

\epsfig{file=images/partialx.eps, width=4cm}

\epsfig{file=images/partialy.eps, width=4cm}

(a) Surface plot of $\sigma_E(\vec{x}_k)$
(b) Surface plot of $\partial \sigma_E(\vec{x}_k)/\partial x_{k1}$ (c) Contour plot of $\partial \sigma_E(\vec{x}_k)/\partial x_{k2}$ (symmetrical to xk1 derivative)


 
Figure 3.3: Surface plot of $\sigma_E(\vec{x}_k)$ for 10 fixed points in 2 dimensions. These points can be identified as distinct peaks - local maxima. The white elliptical area encompasses the global minimum

\epsfig{file=images/surface.eps, width=9cm}



To minimise the loss function with respect to the whole configuration, the algorithm applies a single NR iteration (3.19) to each point in turn, and continually cycles through the configuration until convergence occurs. Since the focus of our work is visualization, we only need to consider MDS in at most three dimensions. By deriving analytical formulae for up to 3 variables and hardcoding them, we have been able to substantially increase the overall efficiency of the algorithm. An algorithm for drawing general undirected graphs [Kamada89] also minimises (3.16) using the NR method, and its basic details are identical.

The sequence in which points are considered when updating the configuration is randomised to ensure faster convergence [Frick94]. This has a side effect of randomising output of the algorithm, even if the same starting configuration is used. To detect convergence a record of previous configurations is kept. The algorithm terminates if the current configuration is identical to a prior one, to within a certain level of precision $\epsilon$ (NR.epsilon parameter); this can be triggered either by a genuine convergence or an occurrence of a cycle. Using a hash value of a complete configuration minimises the storage requirements of this scheme, and makes the test for equality of two configurations take constant time. The Secure Hash Algorithm [NIST95] has been chosen for its property of making it computationally infeasible to find two different messages - MDS configurations in this case - that produce the same hash value.

It is possible for an NR iteration (3.19) to result in a radically different new location of a point from its original one, normally causing a drastic increase in Energy (3.13). This will have a knock on effect on points considered subsequently, and may result in a perpetual instability of the configuration. This would not constitute a cycle in general, and therefore to trap this behaviour a separate mechanism is required. A cyclic buffer of size l (NR.buffersize parameter) is used to record the Energy of the l most recent configurations. If the current configuration is found to be inferior or equivalent to the oldest configuration in the buffer, a flag is set. l configuration updates are then performed unconditionally; if the situation does not reoccur within another l updates the flag is cleared. Otherwise, the flag is kept set, and $\epsilon$ is increased by one order of magnitude. If instability persists $\epsilon$ will increase to a point where any two configurations will be judged equivalent, and the algorithm will necessarily stop.

   
Tabu Search

Tabu search (TS) [Glover95] is a meta-heuristic that improves on known numerical optimisation heuristics by exploiting principles of intelligent problem solving. The key element is the use of flexible memory to guide the optimisation process around difficult regions, and allow boundaries of local optimality to be crossed, so that new regions of solution space can be explored. This is achieved by systematically imposing and releasing restrictions on certain search moves (transitions from one solution to the next), based on their recency, frequency, quality, and influence. These restrictions can be implemented by directly excluding such moves and classing them as `tabu' or `forbidden'; an alternative is to modify loss function evaluations for these moves or the probabilities of their selection.

To illustrate the basic principles of tabu search we will describe the main features of our initial application of this method to MDS. A search move consists of applying a single Newton-Raphson iteration (3.19) to a point in the configuration. Instead of simply cycling through the moves, as is the case for the NR algorithm (see Section 3.3.2), they are selected based on the optimisation history. If a move has been recently performed it is classified as tabu, and excluded from consideration for the number of iterations equal to the duration of the tabu restriction. However, an important exception is taken into account: the restriction is revoked if the move would result in the best solution found so far. Such a mechanism for overriding the tabu classification is termed an aspiration criterion. Another aspiration criterion is used alongside the best solution criterion: aspiration by search direction permits a tabu move that improves on the current solution if it also caused improvement the last time it was performed.

For each move its associated value can be determined as the change to the value of raw Energy (3.16) it would cause if it was applied. Of all admissible (non-tabu) moves the one with the highest value is selected, as long as it is positive. If no improving move exists frequency-based memory is used to determine the winner. The moves are sorted in the descending order of their values, and the one with the smallest sum of its rank and frequency (count of its prior occurrence) is selected. A tie is broken by selecting the least frequent move. Frequency memory operates over the long term, and its application here diversifies the search, driving it into new regions.

We found however that this algorithm had poor convergence performance. The cause of the problem was that moves which consistently had large negative values would only be selected after frequency memory disqualified other moves, i.e. after a large number of iterations. Such moves are temporarily disadvantageous but crucial if a minimum is to be found. Consequently, many iterations were being wasted considering only a subset of the configuration. The second shortcoming was that in order to calculate move values incrementally between iterations, for efficiency reasons, a different loss function to raw Energy (3.16) had to be used. This is due to the fact that (3.16) involves a square root calculation, and its update for each move cannot be determined analytically, i.e. in fixed time. We overcame this problem by squaring both terms in the numerator of (3.16):

 \begin{displaymath}
\sigma_{Er^2}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}...
...ft(d_{ij}^2(\vec{X})-\hat{d}_{ij}^2\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.20)

which can be derived from weighted S-Stress [Takane77] by setting $w_{ij}=\hat{d}_{ij}^{-2}$:

 \begin{displaymath}
\sigma_{r^2}(\vec{X})\stackrel{\mathrm{def}}{=}\sum_{i<j}w_{ij}\left(d_{ij}^2(\vec{X})-\hat{d}_{ij}^2\right)^2
\end{displaymath} (3.21)

However, the expanded analytical formulas for updating (3.20) were complicated and expensive to compute.

The reason that we thought TS should perform better than NR is that it would not indiscriminately apply the moves, and hence could prevent configuration instability altogether. Since it is important for all points to be updated regularly, but avoid drastic moves (see Section 3.3.2), we decided to rely solely on frequency memory for selection, combined with move values, in the revised version of the algorithm. The value of a move is defined as the contribution of the originating point $\vec{x}_i$ to the overall value of (3.16), i.e. it is the total energy of springs stemming from it:

 \begin{displaymath}
\sigma_{Er}(\vec{x}_i)=\sum_{j\ne i}\frac{\left(d_{ij}-\hat{d}_{ij}\right)^2}{\hat{d}_{ij}^2}
\end{displaymath} (3.22)

These values can be updated between iterations with little overhead. The move with the lowest sum of its value rank in descending order and its frequency is selected, favouring the least frequent one if there is a tie. This will have an effect of keeping frequency of all moves roughly equal, and making subsequent selection of a drastic move very likely, which will give the corresponding point an opportunity to finally assume a favourable location. The termination criterion is the same as in the NR algorithm, i.e. when the configuration reaches stability up to a desired level of numerical precision (TS.epsilon parameter).

   
Genetic Algorithm

Genetic Algorithms (GAs) [Goldberg89,Reeves95] exploit random search in an intelligent manner to solve optimisation problems. Because they work with a coding of the parameter set, and not the parameters themselves, they are generic and largely domain independent. To steer the process only some measure of performance or fitness is needed, for example evaluation of the loss function in the case of a function optimisation problem. GAs draw a parallel to the mechanics of natural evolution. A population of candidate solutions (chromosomes) is maintained, and they are selectively combined by means of a crossover operator into a new generation of solutions. A mutation operator is applied probabilistically to diversify the population, and prevent premature convergence. Although selection is random, it is biased towards solutions with high fitness, giving their parameters (genes) a better chance of propagating to subsequent generations, and therefore influencing the optimisation process. A record of the best solution is kept, and returned upon termination.

Because MDS is a continuous optimisation problem, we have decided to use the real coding [Surry95] in our algorithm. A solution is encoded as a vector of floating-point numbers (genes), which directly correspond to coordinates of points in the configuration. Blend crossover (BLX) [Surry95] is used for recombining genes. It takes a parameter $\alpha$ that controls its precise effect, and a pair of values x and y to recombine. Assuming $x\le y$ without loss of generality, its outcome is a uniform random number drawn from the interval $[x-\alpha(y-x), y+\alpha(y-x)]$. BLX0 operates on the range [x,y], and is therefore biased towards the centre. This effect is countered by using $\alpha=0.5$. A new solution is derived from a pair of chromosomes by applying BLX0.5 to each pair of corresponding genes.

The most appropriate form of mutation to use with the real coding is creep mutation [Surry95]. Instead of replacing the affected gene with a new random value, the current value is perturbed by a small amount. This can be achieved by adding on a number drawn from the Gaussian distribution of zero mean and standard deviation sigma (GA.sigma parameter), for example. The crossover operator is most effective when the population is diverse. At this stage of optimisation too high a level of mutation will interfere with crossover. However, once the population starts to converge, crossover becomes ineffective, and a good rate of mutation is essential if different solutions are to be explored, and optimisation to continue [Reeves95]. Therefore, we concluded that an adaptive rate of mutation would work best. Whenever BLX is attempted on a pair of genes that are identical up to a certain level of precision (GA.epsilon parameter), Gaussian mutation is performed instead. This form of mutation reintroduces diversity only when needed.

The final facet of the algorithm is the process of selecting which chromosomes will undergo the genetic operations described above, to form the next generation. To steer the optimisation process, it is important to give a higher likelihood of selection to solutions with high fitness $f(\vec{X})=c-\sigma_E(\vec{X})$, where $c\ge \sigma_E(\vec{X})$ for all $\vec{X}$. However, simply making the probability of selection proportional to fitness has undesirable effects. Initially, a relatively good solution could take over the population, and cause it to converge prematurely around a poor local minimum. At the later stages of optimisation, when the population has largely converged, and differences between solutions are small, best solutions will not receive enough emphasis. The desired level of competition among members of the population can be attained throughout the algorithm run if rank of solution fitness is used instead, as is effectively done in the tournament selection scheme [Reeves95].

Tournament selection takes its name from performing a tournament on a random subset of t solutions from the population of size m (GA.populationsize parameter), and choosing the solution with the highest fitness value as the winner. This selection is performed m times, and the subsets are determined by randomly perturbing the population sequence, and taking successive groups of t elements. Once the pool of solutions is exhausted, a new permutation is created; t repetitions are needed overall. Setting t=2 results in an adequate selective pressure [Surry95], in that the best solution will be chosen twice, the worst not at all, and the median once on average.

Successive pairs of tournament winners are subject to the combined crossover and mutation operator twice, to yield two new solutions. m needs to be divisible by 2t to ensure that these pairs come from a single permutation, and hence avoid the possibility of a solution being mated with itself. A new generation of m solutions is created in this way, and the algorithm can proceed to the next iteration. If a superior solution is encountered it is noted, otherwise the best solution found so far replaces the worst solution in the new population. This mechanism is a form of elitist selection [deJong75], and guarantees monotonic improvement of maximum population fitness, and ultimately convergence. Termination occurs after a set number of iterations (GA.iterationlimit parameter) have been performed without finding an improving solution.

   
Majorization Algorithm

Iterative Majorization (IM) [Ortega70,deLeeuw77] belongs to a class of descent techniques for function minimisation [Press92], as it generates a non-increasing sequence of function values. If a complicated function f to be minimised is bounded from below, like Energy (3.13) for example, IM will converge to a local minimum of f. Instead of minimising f(x) directly, an auxiliary function g(x,z), where z is a constant, is defined and minimised, to get an approximation to a minimum of f(x). By choosing g to be quadratic in x, its unique minimum can be found analytically, although other function forms can be useful, too. g has to meet two conditions to be called a majorizing function of f:
1.
$f(x)\le g(x,z)$, for all x
2.
f(z)=g(z,z), with z referred to as the supporting point
Assuming that x* is the minimum of g(x,z) over x, the majorizing conditions imply the following chain of inequalities:

 \begin{displaymath}
f(x^*)\le g(x^*,z)\le g(z,z)=f(z)
\end{displaymath} (3.23)

Once x* is found it is used as the new supporting point: z'=x*, and thus a better approximation to the minimum of f is established iteratively.

The principle of IM can be easily generalised to multivariable functions. There exists a quadratic majorizing function for weighted raw Stress (3.17), and it forms the basis of the SMACOF algorithm [deLeeuw77,Borg97] (the acronym stands for Scaling by MAjorizing a COmplicated Function). The scale dependency of (3.17) (see Section 3.2.2) is circumvented by explicitly normalising $\{\hat{d}_{ij}\}$, so that $\sum_{i<j}w_{ij}\hat{d}_{ij}^2=n(n-1)/2$. By setting each $w_{ij}=\delta_{ij}^{-2}$ this algorithm can be made to minimise raw Energy (3.16). The explicit normalisation under the ratio MDS model ( $\hat{d}_{ij}=a\delta_{ij}$) becomes:

\begin{eqnarray*}\sum_{i<j}\delta_{ij}^{-2}a^2\delta_{ij}^2&=&\frac{n(n-1)}{2}\\
a^2\sum_{i<j}1&=&\sum_{i<j}1\\
a&=&1
\end{eqnarray*}


that is we arrive at the absolute MDS model with loss function (3.16).

The most recent implementation of the SMACOF algorithm is ProxScal version 6.4 [Busing97], and we decided to use it as a benchmark for our MDS algorithms (see section 3.5). The stopping criteria for the algorithm are met when either the rate of change of Energy (3.13) drops below a specified threshold value (ProxScal.rate parameter), or a maximum number of iterations has been reached (ProxScal.iterationlimit parameter).

   
Simulated Annealing

Simulated Annealing (SA) [Dowsland95] generalises descent methods of local optimisation (see Section 3.3.5 for an example) by allowing uphill moves, i.e. ones increasing the value of the loss function. It is hoped that by introducing such moves the method will be able to escape local minima, and ultimately converge to a global minimum% latex2html id marker 17429
\setcounter{footnote}{3}\fnsymbol{footnote}. To satisfy this objective the acceptance probability for an uphill move has to be related to the magnitude of the increase, and the optimisation history. A convenient formula can be borrowed from thermodynamics:

 \begin{displaymath}
p(\delta E)=\exp \left(-\frac{\delta E}{kT}\right)
\end{displaymath} (3.24)

which gives the probability of an increase $\delta E$ in energy at temperature T. k is Boltzmann's constant, and is obviously meaningless in the function minimisation context, thus the whole denominator is replaced by a single parameter t, although the thermal nomenclature is retained. For negative $\delta E$ the formula (3.24) produces $p(\delta E)>1$, thus a downhill move is always accepted. Initially the temperature is high, and so is the probability of accepting an uphill move of a given magnitude. During the course of the optimisation the temperature is reduced, and so is the associated probability, until both attain low enough values for the procedure to effectively reduce to a very slow descent, at which stage it can be stopped. SA draws obvious analogies to the physical process of annealing, as it originates from simulation studies in this field [Metropol53,Kirkpatr83].

At a given temperature t all n points in the configuration are considered one at a time. A move is generated by adding a deviate from the Cauchy distribution to each coordinate of point $\vec{x}_i$:

 \begin{displaymath}
\tilde{x}_{ik}=x_{ik}+t\tan p
\end{displaymath} (3.25)

where p is drawn uniformly from $(-\pi/2,\pi/2)$, and temperature t is acting as the scale parameter of the distribution. The update formula (3.25) will produce Gaussian like local displacement most of the time, with an occasional long jump, which allows radical moves to be tried even at low temperatures in a controlled manner. Thus temperature can be reduced quicker, compared to using a deviate from the Gaussian distribution in (3.25), while maintaining the properties of global optimisation [Szu87].

The new value of raw Energy (3.16) can be calculated efficiently since only one point is affected, by maintaining a breakdown of the contribution (3.22) of each point to the overall value, as in Section 3.3.3. Rather than calculating the acceptance probability from (3.24), and comparing it with a uniform random number $p\in[0,1]$ to decide whether to accept the move, the maximum acceptable value of (3.22) can be calculated in advance by rearranging (3.24):

 \begin{displaymath}
\xi=\sigma_{Er}(\vec{x}_i)-t\ln p
\end{displaymath} (3.26)

allowing the calculation of $\sigma_{Er}(\tilde{\vec{x}}_i)$ to be aborted as soon as it exceeds $\xi$ [Wright89]. This modification improves the performance, especially when the rejection ratio is high. There is no need to randomise the order of points, as in Section 3.3.2, since not all the moves will be accepted according to (3.26), and thus the sequence of accepted moves will automatically be scrambled.

Central to the design of a simulated annealing algorithm is the choice of a cooling schedule, i.e. the manner in which the temperature parameter t is reduced. It is a common observation that most of the useful work is done in the middle of the schedule [Dowsland95], and thus it is important to chose an appropriate range of temperatures, over which the cooling occurs. However, the optimal temperature range is not only application dependent, but will be different for every instance of the problem, in general. The actual temperature reduction function is less important, and a geometric function ti+1=ati is the most common choice in practice, for some $a\in(0,1)$ (SA.geometric parameter) [Dowsland95].

Since the amount of useful work done by the algorithm is the overriding consideration, we decided to specify a target acceptance ratio $\bar{\mu}$ (SA.acceptance parameter) instead of prescribing the temperature range [Dowsland93]. The actual acceptance ratio $\mu$ is the proportion of both downhill and uphill moves accepted to the total number of moves n at each temperature. At high temperature many moves generated by (3.25) will result in a significant increase to (3.22), and according to formula (3.26) will be rejected, thus $\mu$ will be low. Conversely, at lower temperature more moves will be downhill, and the remaining uphill ones are likely to be more local, and cause smaller increases to (3.22), therefore, $\mu$ will be higher.

Temperature is updated according to the following formula:

\begin{displaymath}t_{i+1}=a^{(\bar{\mu}-\mu_i)n}t_i
\end{displaymath} (3.27)

If temperature ti is too high $\mu_i$ will be less than $\bar{\mu}$, and for each deficit acceptance temperature is multiplied by a, and thus reduced in proportion to the deficit. Conversely, too low a value of ti causes $\mu_i$ to be greater than $\bar{\mu}$, and the temperature is divided by a for each surplus acceptance, i.e. proportionally increased. Therefore, this schedule will eventually converge at iteration j to a state where $\mu_k\approx \bar{\mu}$, for all subsequent iterations $k\ge j$. The speed of this convergence is dependent on the value of a and t0 (SA.initialtemperature parameter), although we expect the algorithm to be fairly insensitive to their exact choice, as only the value of $\bar{\mu}$ determines the equilibrium temperature. The algorithm goes through a set number of temperature updates (SA.iterationlimit parameter), and maintains a record of the best solution generated.

   
Experimental Setup

In order to evaluate the relative merits of each algorithm described in Section 3.3, we used a test bed consisting of 66 data collections: 2 dissimilarity matrices, 2 image collections, 6 graphs, and 56 data tables (see Section B.1 and B.2 in the appendices for details). The number of objects in a collection ranged from 9 to 1484, with the median value of 159. Since least-squares MDS algorithms are heuristics for function minimisation, and can only be expected to find local minima in general, we decided to scale each data collection 10 times, to give us a more complete picture of the algorithms' performance. We recorded the minimum and average Energy (3.13) values of 10 trials, and the mean running time per trial for every combination of data collection and algorithm. PCO is an analytical method (see Section 3.3.1), and only a single trial per data collection was necessary.

Parameters of the algorithms were fixed to the following values:


 
Table 3.1: Wilcoxon test results for SA.acceptance parameter


relationship1 # cases2 rank sum3 p 4
$10\downarrow 5$ 53 1938 0.001
$10\uparrow 5$ 13 273  
$15\downarrow 10$ 49 1868 0.001
$15\uparrow 10$ 17 343  
$20\downarrow 15$ 43 1599 0.001
$20\uparrow 15$ 23 612  
$25\downarrow 20$ 39 1384 0.076
$25\uparrow 20$ 27 827  
$30\downarrow 25$ 36 1109 0.985
$30\uparrow 25$ 30 1102  

(a) minimum Energy

 
relationship # cases rank sum p
$10\downarrow 5$ 50 1861 0.001
$10\uparrow 5$ 16 350  
$15\downarrow 10$ 41 1564 0.003
$15\uparrow 10$ 25 647  
$20\downarrow 15$ 32 1157 0.746
$20\uparrow 15$ 34 1054  
$25\downarrow 20$ 30 1151 0.775
$25\uparrow 20$ 36 1060  
$30\downarrow 25$ 27 778  
$30\uparrow 25$ 39 1433 0.036

(b) average Energy



1 $t\downarrow u$ denotes that Energy for acceptance ratio $t\%$ is lower than for $u\%$, and $t\uparrow u$ implies the opposite
2 the number out of 66 cases for which the relationship holds
3 absolute differences in Energy for both values of the acceptance ratio parameter across all cases are ordered on their magnitude. Ranks for which differences are negative are summed separately from the positive ones, and their total is $\sum_{r=1}^{66}r=2211$. These two sums should be roughly equal under H0: no overall difference in Energy for both acceptance ratios
4 probability of rejecting H0 when it is true, nominally an acceptable value - the significance level of the test - is no greater than 5%

 

   
Statistical Analysis

The statistical analysis of MDS algorithms of Section 3.3 is structured into three distinct parts: minimum Energy, average Energy, and running time experiments, each of which consists of 6 related measurements, one for each algorithm, made for every data collection in the test bed. For such an experimental design there are two statistical tests: two-way analysis of variance, or its non-parametric cousin, the Friedman test [Siegel56]. The former can only be used if measurements are independently drawn from normally distributed populations, these populations have identical variance, and their means are linear combinations of effects due to differences in algorithms and data collections. The latter test is more general and robust, as it does not make any of these assumptions, and we preferred to use it for our statistical analysis.

   
Minimum Energy


 
Table 3.2: Friedman test results for minimum and average Energy

algorithm minimum average
NR 2.86 3.27
TS 2.29 2.80
GA 4.11 3.89
SA 3.27 2.67
IM 2.55 2.61
PCO 5.92 5.76
$\chi_r^2$ 1 171.38 137.33
W 2 0.519 0.416


1 $\chi_r^2$ is Friedman's chi-square statistic; the critical value at p=0.001 level is 20.52
2 W is Kendall's coefficient of concordance; it ranges between 0 (no agreement between cases) to 1 (complete agreement); the critical value at p=0.001 level is 0.062
The figures in rows 2-7 are mean ranks across 66 cases, which are computed by summing the relative order of minimum or average Energy for a particular algorithm over all cases, and dividing by their total number

A Friedman test has been carried out with the null hypothesis H0 of no difference between algorithms with regard to minimum Energy, and the alternative hypothesis H1 of at least one algorithm having a different ranking from the others; the results are presented in the second column of Table 3.2. The values of both Friedman's chi-square statistic and Kendall's coefficient of concordance are statistically significant at a level exceeding 0.001, which allows H0 be rejected in favour of H1. This permits the use of the Tukey Multiple Comparison procedure to test for significant differences between algorithm rankings in pairs [Keselman77].


 
Table 3.3: Tukey Multiple Comparison results


rank comparison Q 1 p 2
PCO>TS 15.79 * 3
PCO>IM 14.64 *
PCO>NR 13.32 *
PCO>SA 11.51 *
PCO>GA 7.90 *
GA>TS 7.90 *
GA>IM 6.74 *
GA>NR 5.43 *
GA>SA 3.62 0.003
SA>TS 4.28 *
SA>IM 3.13 0.010
SA>NR $\ll 4.03$ 0.244
NR>TS 2.47 0.008
NR>IM $\ll 4.03$ 0.037
IM>TS $\ll 4.03$ 0.089

(a) minimum Energy

 
rank comparison Q p
PCO>IM 13.69 *
PCO>SA 13.42 *
PCO>TS 12.83 *
PCO>NR 10.79 *
PCO>GA 8.09 *
GA>IM 5.59 *
GA>SA 5.33 *
GA>TS 4.74 *
GA>NR 2.70 0.103
NR>IM 2.90 0.005
NR>SA $\ll 4.03$ 0.003
NR>TS $\ll 4.03$ 0.045
TS>IM $\ll 4.03$ 0.165
TS>SA $\ll 4.03$ 0.159
SA>IM $\ll 4.03$ 0.439

(b) average Energy



1 Q is Tukey's statistic for multiple comparisons; the critical value at p=0.05 level is 4.03
2 the p value associated with a one-tailed Wilcoxon signed-rank test
3 * denotes a significant Tukey comparison, and thus no need for a Wilcoxon test

Table 3.3(a) demonstrates that PCO has the highest overall ranking, meaning that it achieves the worst minimum Energy. The comparisons between GA and the remaining algorithms, except SA, are significant at p<0.05 level. To verify the tie of GA and SA, we decided to perform a one-tailed Wilcoxon signed-rank test, because it is considerably more powerful [Siegel56]. It produces a highly significant p value of 0.003 (see Table 3.3(a)), which taken together with the multiple comparison results implies that GA is the second worst in terms of minimum Energy. The only remaining significant Tukey comparison is that between SA and TS; additionally, the Wilcoxon test is significant for (SA,IM), (NR,TS), and (NR,IM) pairs. Thus, it can be inferred that IM and TS are tied in the first place, and are significantly better than both SA and NR, which in turn are also equivalent in performance.

Average Energy

The third column of Table 3.2 contains the results of the Friedman test with the null hypothesis H0 of no difference between algorithms with regard to average Energy, and the alternative hypothesis H1 of at least one algorithm having a different ranking from the others. Both Friedman's chi-square statistic and Kendall's coefficient of concordance are statistically significant at p<0.001 level, which allows H0 be rejected in favour of H1. The Tukey Multiple Comparison procedure can subsequently be performed, to find out which pairs of algorithm rankings are significantly different.

Since PCO is an analytical method (see Section 3.3.1) the minimum and average values of Energy are taken to be the same. The Tukey comparisons between PCO and the other algorithms are significant at the 5% level according to Table 3.3(b), implying that PCO has the highest overall ranking with regard to average Energy. The multiple comparisons between GA and the remaining algorithms are significant at p<0.05 level, except with NR, and this tie is confirmed by the Wilcoxon test. All remaining Tukey comparisons are not statistically significant; however, the Wilcoxon tests of NR versus IM, SA, and TS in turn are significant at the 5% level. Thus the following ordering can be inferred: SA, IM, and TS jointly achieve the lowest average Energy, and are significantly better than NR and GA, which are tied themselves.

   
Running Time

The algorithms under test have different time complexities: PCO is O(n3) as it is based on an eigendecomposition of an $n\times n$ matrix [Press92] (also see Section 3.3.1 and 4.3), the other algorithms are least-squares techniques, and are O(n2) per iteration% latex2html id marker 17911
\setcounter{footnote}{4}\fnsymbol{footnote}, except that IM is O(n3) per iteration when minimising raw Energy (3.16), as it involves a multiplication of two $n\times n$ matrices [deLeeuw77,Borg97]. Therefore, we decided not to perform the analysis of variance on the time measurements, as the majority of data sets in the test bed are small (see Section 3.4), and an unfair advantage would be given to PCO and IM, which are efficient for such data, but are quickly outperformed when n grows.


 
Table 3.4: Comparison of algorithm running time

algorithm total time1 time ratio2 linear fit3
SA 136 n/a n/a
NR 152 1.12 1.34
PCO 165 1.22 1.51
IM 345 2.54 2.74
TS 1074 7.91 9.48
GA 86696 4 638.36 764.03


1 average time in minutes required for a single test trial with the test bed
2 ratio of average time for a given algorithm to that of SA
3 the parameter of a least-squares fit of SA timings to that of each algorithm
4 approximately 60 days

The second column of Table 3.4 reports the average running time of a single test trial for each algorithm. The fastest algorithm with our test bed is SA, and it is closely followed by NR and PCO. For convenience, the ratio of running time for every algorithm to that for SA is presented in the third column of Table 3.4. IM and TS are considerably slower, but are within one order of magnitude of SA. GA is by far the slowest algorithm, as it required almost one and a half years of CPU time of an Intel Pentium II 300 MHz workstation to complete the experiment% latex2html id marker 17915
\setcounter{footnote}{5}\fnsymbol{footnote}.

The total running time and its ratios in Table 3.4 are peculiar to the composition of the test bed. To derive a more robust estimation of relative speed of the algorithms, we decided to perform a linear regression of measurements for SA to that of other algorithms. The timings for an individual algorithm i are collected in the vector $\vec{m}^{(i)}$, with one element for each data collection in the test bed. The parameter ai that gives the best least-squares fit of the model $\vec{m}^{(i)}=a_i\vec{m}^{(SA)}$ is taken as an empirical estimate of how much slower algorithm i is compared to SA, and is reported for each algorithm in the last column of Table 3.4. These parameters are in a good agreement with the running time ratios, and thus the conclusions of the previous paragraph hold.

   
Identifying the Best Algorithm

SA was identified as the fastest algorithm for our experimental settings, while at the same time sharing the lowest average and second lowest minimum Energy ranking. Close contenders are IM and TS, which achieved the lowest minimum and average Energy, however at a 2.5 and 8 fold increase in running time, respectively (see Table 3.4). This comparison may be unfair on SA, which exhibits good global optimisation characteristics, confirmed by the average Energy performance, but perhaps was not given enough time to refine its solutions, and achieve competitive minimum Energy scores. Therefore, we decided to re-run the SA tests, setting SA.iterationlimit to 2500 this time (see Section 3.3.6 and 3.4).


 
Table 3.5: Wilcoxon test results for SA-2500 and hybrid algorithms

  SA-2500 hybrid
algorithm #cases1 rank sum2 p 3 #cases rank sum p
PCO 65 2166 0.001 65 2209 0.001
GA 47 1670 0.001 59 2057 0.001
SA-1000 53 1773 0.001 66 2211 0
NR 30 1203 0.538 47 1623 0.001
IM 34 991 0.469 44 1374 0.001
TS 25 897 0.185 44 1481 0.007
SA-2500 n/a 46 1624 0.001


1 the number out of 66 cases for which minimum Energy achieved by a given algorithm was greater than that of SA-2500 or hybrid
2 absolute differences in minimum Energy for both algorithms across all cases are ordered on their magnitude. The sum of ranks for which SA-2500 or hybrid achieved lower Energy than a given algorithm is reported; the rank total is $\sum_{r=1}^{66}r=2211$. The rank sum should be about a half of the total under H0: no overall difference in minimum Energy for both algorithms
3 probability of rejecting H0 when it is true, nominally an acceptable value - the significance level of the test - is no greater than 5%

Table 3.5 shows the results of the two-tailed Wilcoxon signed-rank test between SA-2500 and all other algorithms. The null hypothesis is H0: no difference in minimum Energy between SA-2500 and a given algorithm. H0 is rejected for PCO, GA, and SA-1000; moreover, SA-2500 can be identified to be significantly better that these algorithms from the relevant rank sums. There is no evidence to reject H0 for NR, IM, and TS, however. Since SA-1000 was tied with NR, too, but worse than the other two algorithms in terms of minimum Energy (see Section 3.5.1), SA-2500 is only slightly better, and occupies the middle ground.

   
Hybrid Algorithm

Section 3.5.4 demonstrated that prolonging the SA run did improve the minimum Energy performance, but not as much as we hoped. We realised that the extra time could be spent on running a different heuristic after SA-1000, instead. IM seemed the best choice, as it is a strictly local minimisation procedure, i.e. will converge to the local minimum nearest to an SA-1000 solution% latex2html id marker 18018
\setcounter{footnote}{6}\fnsymbol{footnote}. Therefore, the best configuration of 10 SA-1000 runs for each data collection in the test bed was used as a starting point for IM, and the final value of Energy (3.13) recorded. The total time of the test for this hybrid algorithm was 2.87 times that of SA-1000 alone. This is comparable to 2.99 for SA-2500, and 2.54 for IM (see Table 3.4). The linear regression coefficients (see Section 3.5.3) are also similar at 3.01, 2.99, and 2.74 respectively. Thus, the best algorithm can be decided on the minimum Energy performance alone.

The results of the two-tailed Wilcoxon signed-rank test between hybrid and the remaining algorithms can be found in Table 3.5. The null hypothesis H0 of no difference between algorithms can be rejected in every case. The rank sums are in favour of hybrid, thus it can clearly be identified as the best MDS algorithm. As expected, hybrid improves on SA-1000 configurations for all data collections in the test bed.


 
Table 3.6: Linear regression comparison of MDS algorithms

algorithm ratio1
PCO 1.9873
GA 1.0561
SA-1000 1.0476
NR 1.0446
SA-2500 1.0397
IM 1.0018
TS 1.0016


1 overall ratio of minimum Energy for a given algorithm to hybrid

To complete the comparison of the various algorithms described in this chapter, we decided to perform a linear regression of minimum Energy measurements for the best algorithm to that of the others. Assuming that measurements for all data collections in the test bed with a given algorithm i are collected in the vector $\vec{m}^{(i)}$, the linear model becomes: $\vec{m}^{(i)}=a_i\vec{m}^{(\mathit{hybrid})}$. Table 3.6 presents all coefficients ai in the descending order. This ordering is exactly the same as inferred by the statistical tests, namely: PCO is the worst in terms of minimum Energy, followed by GA, the next place is jointly occupied by NR and SA-1000; SA-2500 is only slightly better. TS and IM achieve considerably lower Energy overall, and are only 0.2% worse than hybrid on average. The improvement between SA-1000 and hybrid is almost 5%, and fully justifies using a local minimisation heuristic to refine SA solutions.

   
Qualitative Evaluation

In Section 3.5 we were able to demonstrate by means of statistical tests that the algorithms of Section 3.3 differ significantly in terms of output quality. It is a much more objective and efficient method of comparing MDS configurations for all $6\times 66$ combinations of an algorithm and a test data collection, than inspecting them visually. Nonetheless, it is important to show some examples, so that the magnitude and character of the differences can be appreciated. For this purpose we have selected a sample data set from the test bed, which appears in Section B.1 in the appendices as cpu-performance. It is a tabular record of 7 quantitative attributes for 209 CPUs.


   
Figure 3.4: Visualizations of the cpu-performance data table with different MDS algorithms. Dots represent individual CPUs - rows in the table. Each configuration was fitted to the one for the hybrid algorithm by Procrustes analysis; $\sigma_E(hybrid)=0.0308$. Lines link pairs of corresponding points in both configurations, and thus show the extent of the differences


\epsfig{file=images/mds-pco.eps}

\epsfig{file=images/mds-ga.eps}

(a) $\sigma_E(\mathit{PCO})=0.0746$

(b) $\sigma_E(\mathit{GA})=0.0475$


  
\epsfig{file=images/mds-nr.eps}

\epsfig{file=images/mds-ts.eps}

(c) $\sigma_E(\mathit{NR})=0.0418$

(d) $\sigma_E(\mathit{TS})=0.0318$


  
\epsfig{file=images/mds-sa.eps}

\epsfig{file=images/mds-im.eps}

(e) $\sigma_E(\mathit{SA})=0.0327$

(f) $\sigma_E(\mathit{IM})=0.0325$


Visualizations produced by all algorithms are arranged side by side in Figure 3.4. The experimental setup of Section 3.4 has been used, thus the best configuration of 10 runs for each algorithm is shown, except that the PCO configuration is unique. To ease the task of comparing these visualizations, and only focus on significant differences, a common orientation of the figures was assumed. Each configuration has been transformed by Procrustes Analysis [Borg97,Cox94] with regard to a reference configuration - that for the hybrid algorithm, as it achieved the lowest value of Energy (3.13) of all algorithms. Procrustes Analysis seeks the optimal transformation of a configuration to match a given target configuration. Since Energy (and Stress (3.11) for that matter) with the ratio MDS model are invariant under dilation and rigid motions: translation, rotation, and reflection of a given configuration, these transformations were allowed.

As can be seen from Figure 3.4(e), the configurations for SA and hybrid - the reference configuration - are very similar. Since hybrid consists of an SA run followed by IM, i.e. it finds the local minimum closest to the configuration produced by SA, such a correspondence was expected.

The level of agreement between a pair of configurations $\vec{X}$ and $\vec{Y}$ can be measured by the Tucker's congruence coefficient [Tucker51] (also see Section 3.2.2):

 \begin{displaymath}
c(\vec{X},\vec{Y})=\frac{\sum_{i<j}d_{ij}(\vec{X}) d_{ij}(...
...}{2}}\left(\sum_{i<j} d_{ij}^2(\vec{Y})\right)^{\frac{1}{2}}}
\end{displaymath} (3.28)

The coefficient achieves its maximum value 1 when $d_{ij}(\vec{X})=bd_{ij}(\vec{Y})$, for all i,j. The congruence coefficient is invariant under dilations of $\vec{X}$ and $\vec{Y}$:

\begin{displaymath}c(a\vec{X},b\vec{Y})=\frac{\sum_{i<j}ad_{ij}(\vec{X}) bd_{ij}...
...b^2d_{ij}^2(\vec{Y})\right)^{\frac{1}{2}}}=c(\vec{X},\vec{Y})
\end{displaymath}

and under rigid motions, since distances are not affected by them. Therefore, $c(\vec{X},\vec{Y})$ will not change after carrying out Procrustes Analysis between configurations $\vec{X}$ and $\vec{Y}$, and determines their fit in advance.


 
Table 3.7: Congruence coefficient for pairs of MDS configurations

  hybrid SA IM TS NR GA
SA 0.9989          
IM 0.9946 0.9938        
TS 0.9918 0.9905 0.9923      
NR 0.9909 0.9881 0.9897 0.9921    
GA 0.9734 0.9725 0.9733 0.9748 0.9736  
PCO 0.9783 0.9765 0.9800 0.9796 0.9796 0.9604


Let the configuration for algorithm z be denoted as $\vec{X}^{(z)}$. The value of $c(\vec{X}^{(\mathit{SA})},\vec{X}^{(\mathit{hybrid})})$, which can be found in Table 3.7, is higher than for any other pair, and very close to the maximum value possible. Overall, $\vec{X}^{(\mathit{IM})}$ in Figure 3.4(f) corresponds well to $\vec{X}^{(\mathit{SA})}$, except that some outliers assume an opposite orientation with regard to the main group of points. Nonetheless, Energy (3.13) is virtually the same in both cases, and the congruence coefficient is very high among these two configurations and also $\vec{X}^{(\mathit{hybrid})}$. This is an apt demonstration that there can exist equivalently good proximity visualizations of a given data collection, with nontrivial differences between them. This point is emphasised when considering $\vec{X}^{(\mathit{TS})}$, which exhibits even more differences to $\vec{X}^{(\mathit{hybrid})}$, but is judged similar by both the loss function and the congruence coefficient.

$\vec{X}^{(\mathit{NR})}$ is a significant departure from the reference configuration, and consequently congruence with $\vec{X}^{(\mathit{SA})}$ and $\vec{X}^{(\mathit{IM})}$ is moderate, relative to the values of coefficient c for the algorithms considered so far. However, $c(\vec{X}^{(\mathit{NR})},\vec{X}^{(\mathit{TS})})$ is the fifth highest, despite NR being much worse in terms of Energy, and many similarities between Figures 3.4(c) and 3.4(d) can be noticed.

PCO and GA have produced very different configurations to that of the hybrid algorithm, since many long lines can be seen in Figure 3.4(a) and 3.4(b). Values of the congruence coefficient with configurations for the other algorithms are relatively low, and $c(\vec{X}^{(\mathit{PCO})},\vec{X}^{(\mathit{GA})})$ is actually the lowest. Energy of $\vec{X}^{(\mathit{PCO})}$ is much worse than for $\vec{X}^{(\mathit{GA})}$, which in turn is inferior to the remaining configurations. The PCO configuration is completely different to that of the least-squares algorithms. Points are close to two perpendicular lines - principal components (see Section 3.3.1), and local detail is not represented as well as with the other algorithms.

   
Discussion

Several MDS algorithms have been described in this chapter, and compared on the basis of their performance with a sample of small and medium-sized data collections. A histogram of Energy (3.13) minima calculated by the hybrid algorithm for all data sets is presented in Figure 3.5. The distribution of data collections between buckets is typical of a balanced sample, with most having moderate Energy, and extreme cases being in minority. The minimum Energy for a data set is indicative of its complexity, and a challenge it represents to an MDS algorithm in finding a good low-dimensional representation for it. Therefore, we are confident in the validity of statistical inference presented in Section 3.5, and conclude that these results go beyond just the data collections considered there.


 
Figure 3.5: Histogram of Energy minima calculated by the hybrid algorithm for all collections in the test bed


\epsfig{file=images/mds-histogram.eps, width=8.25cm}


PCO is shown to achieve inferior results both objectively and subjectively, and this can be explained by the characteristics of the loss function (3.18) it actually minimises, compared to Energy (3.13). However, its solutions are unique, and can be computed efficiently for smaller data sets; consequently, with such data PCO might serve for other MDS algorithms as the source of a better starting configuration than a purely random one [Kruskal78b]. The least-squares algorithms are much more effective at minimising Energy. The overall difference in Energy between the best and the worst heuristic is less than 6%; thus the correctness of these algorithms is cross-validated.

It is not practical to run GA on standard computer hardware, as it takes prohibitively long to produce its solutions. However, this algorithm can easily be parallelised [Reeves95], and merits consideration with such an architecture. NR and SA are the fastest MDS algorithms, and actually perform slightly better than GA. IM and TS are able to achieve even lower Energy on average, but at the same time are significantly slower. If one is willing to accept such an increase in running time then a combination of SA followed by IM performs the best. Such a hybrid approach takes advantage of the global optimisation characteristics of Simulated Annealing, in order to find the neighbourhood of a global or a good local minimum, and converge on this minimum with a descent technique of local optimisation.

Related Work

Multidimensional Scaling has been applied in many other areas, and consequently the literature is vast and dispersed over many periodicals and books. We will not attempt to give an overview of the developments to date, and refer the reader elsewhere [Borg97,Cox94]. Also, the Psychometrika journal is the single richest source of reference for MDS, as it regularly includes articles on the subject. Here, we give a brief summary of other work directly related to the algorithms discussed in this chapter.

The graph drawing community proposed a number of algorithms for constructing straight-line drawings of general undirected graphs, and an evaluation of the five most popular has been carried out [Brandenb95]. The study compared running time, number of edge crossings, and edge length uniformity across a collection of 79 graphs. One of the algorithms [Kamada89] is in fact an MDS implementation minimising raw Energy (3.16) (see Section 3.3.2). This algorithm was found to produce drawings of comparable quality to that of other algorithms, with the least computational cost. This favourable result should, therefore, extend to other MDS algorithms for this domain. The analysis presented (in the form of charts) was inconclusive for the aesthetic criteria, and the recommendations were based purely on running time of the algorithms.

In Sections 3.3.6 and 3.5 it has been demonstrated that the Simulated Annealing heuristic improves the likelihood of finding a global minimum of a loss function like Energy (3.13). The Deterministic Annealing approach [Klock97] capitalises on the benefits of its stochastic counterpart, while avoiding the multitude of choices and challenges in designing and fine tuning a successful algorithm. It does so by considering the statistical behaviour of Simulated Annealing as an entropy maximisation procedure, and deriving an approximate but computationally tractable solution. An algorithm for minimising Stress (3.11) based on this method is subsequently proposed.


next up previous
Next: Large Scale Visualization Up: Proximity Visualization of Abstract Data Previous: Multivariate Visualization Techniques

© 2001 Wojciech Basalaj