Next: Proximity Grid
Up: Proximity Visualization of Abstract Data
Previous: Multidimensional Scaling
Subsections
Large Scale Visualization
Abstract data collections can be visualized effectively with Multidimensional Scaling; however, the process becomes very time consuming for large data sets. This chapter explores alternatives that aim to achieve a substantial speed up, without sacrificing the quality of visualization too much. A novel technique is proposed that combines cluster analysis with least-squares MDS, to reduce both running time and space requirements of traditional MDS, while adequately representing proximity relationships in a large data collection. Multivariate statistics provides an alternative method of visualizing data tables, which is equivalent to classical scaling, but dependent on the number of attributes instead of objects, and thus more efficient for large data sets. Both methods are compared statistically and qualitatively on a sample of large data tables.
Challenges of Visualizing Large Data Collections
The amount of screen space available to each icon representing an object from a data collection becomes very limited if the collection is large. Here, it is more important to convey a proper overview of relationships among the objects than individual detail. Proximity Visualization is particularly suitable, as it will accurately portray the distribution of objects and clusters. Detailed information can be accessed by interacting with this overall visualization. For example, an object could be singled out, and its attributes shown to the user. Alternatively, a subset of the objects could be specified with a visual query, and visualized on its own with more detailed icons, as in Chapter 2.
Traditional multidimensional scaling is unsuitable for visualizing a large data collection, because the amount of pairwise dissimilarities that have to be considered grows quadratically in the number N of objects (see Section 3.1). However, there is a degree of redundancy in collecting all
dissimilarities, especially when no error is present [Young72], as is the case with the dissimilarity coefficients of Section 1.2. Thus we may attempt to identify and discard unimportant dissimilarities without affecting the accuracy of the visual representation.
Let us define the number of degrees of freedom for configuration
of N points in p dimensions [Young70]:
 |
(4.1) |
A simple relation has been empirically shown to exist between
and the proportion
of dissimilarities that must be retained for an MDS procedure to successfully recover a matching configuration
[Spence74]:
Admittedly, the study is only based on
and p=3, but shows that there is scope for discarding some dissimilarities without compromising the recovery of
.
In practice, most large data collections are multivariate tables, for example the test data of Section B.3 in the appendices. Dissimilarity matrices with a comparable number of objects are rare, as such a huge number of pairwise experimental measurements would be impractical to obtain, except when derived from an intermediate representation, e.g. some sort of summary for each object, which may be expressed as a multivariate table, anyway. Alternatively, an incomplete design could be applied, where only a subset of all dissimilarities is collected [Spence74], adhering to the guideline provided by (4.2) when deciding on the subset size
. Likewise, large connected graphs are not very common, and disconnected graphs can be visualized by considering each connected component individually.
The advantage of considering only tabular data is that the variables (columns) can be manipulated to derive a smaller number of variables, preferably two or three to facilitate direct visualization. Attribute clustering is one possibility, whereby correlated groups of variables are hierarchically aggregated, starting with as many groups as the original variables [Anderber73]. A particular level in the hierarchy with the desired number of groups is then selected, and these groups serve as the new variables. However, a more fine-grained composition of the original variables is possible with Principal Components Analysis [Pearson01,Hotellin33]. With this reclassification method, each derived variable is expressed as a linear combination of the original variables, subject to the new variables being mutually uncorrelated, or orthogonal in other words.
Incremental Multidimensional Scaling
An incremental MDS method, that achieves a good compromise between solution quality and the number of dissimilarities considered, is proposed here. The discarded dissimilarities are those which are not needed to preserve the overall shape of the MDS configuration. This overall shape is established by first carrying out a single-link clustering of the objects from a collection using a suitable dissimilarity coefficient (see Section 1.2). The clustering is used to select a subset of objects, which make up cluster diameters at a certain level of the cluster hierarchy. Standard MDS is then applied to this subset - full scaling stage - thus establishing a skeleton of marker points. The remaining objects are positioned within this skeleton configuration by only considering their dissimilarities to the fixed objects - single scaling stage. For a very large data collection the skeleton configuration can itself be built up incrementally.
The particular choice of an MDS method has little significance to the overall incremental approach described in the following sections. However, the method should allow new objects to be added to an existing configuration (single scaling) with little cost. Least-squares metric MDS fulfils this criterion; see Section 3.3 for examples of algorithms that could be applied.
Single-link Clustering
To determine an ordering of objects for the incremental method a Minimum Spanning Tree (MST) is computed for a fully connected graph, with one vertex for each object, and edge weights taken to be dissimilarities between vertices. MST is a connected graph with N vertices and N-1 edges, having the minimal cumulative weight of its edges [Harary69]. Prim's algorithm [Prim57] is used, as it is the most efficient MST algorithm for dense graphs [Sedgewic92].
There is a close relationship between a single-link clustering and an MST [Rohlf82], and the former can be derived from the latter trivially. Edges of the MST with the greatest weight link the highest level clusters in the cluster hierarchy, and thus define the overall structure of data. The order of objects is determined by taking vertices of MST edges sorted in descending order of weight, with only the first occurrence of a vertex noted. Therefore, the position of an object in the sequence will depend on its structural importance, with the most important objects coming first.
This discussion assumes no significant effect of outliers. In cases where they are a real concern, a data preparation stage might be necessary, for example replacing actual dissimilarities by their rank order (see Sections 1.2.9 and 3.2.1). Other cluster analysis techniques can be used instead, e.g. complete or average linkage; however, single-link clustering is the least computationally intensive [Anderber73].
The Algorithm Outline
- 1.
- objects are sorted in a descending Minimum Spanning Tree order
- 2.
-
are assigned random coordinates;
- 3.
- full scaling is performed on
- 4.
-
are added to the configuration by single scaling with respect to
- 5.
-
, and steps 3 and 4 are repeated until there are no more objects to add: ni=N
The choice of n1 determines how far to go down the single-link cluster hierarchy in order to form the first skeleton of marker points. The configuration is established by performing full scaling starting from random positions. Further objects from the lower levels in the cluster hierarchy up to a total of n2 are then included in the configuration by single scaling. The extended set of n2 points is taken as the initial configuration for a further full scaling run, and the new skeleton so formed is enlarged to n3 points by single scaling. The process of refining to form a skeleton by full scaling and then enlarging the configuration by single scaling is repeated until there are no more objects to be included. Note that the saving in cost arises because at the final step new objects are incorporated using single scaling only. These objects are associated with the lowest levels in the cluster hierarchy, and so may be positioned independently of one another without affecting the overall structure of the configuration.
A sensible choice of n1 - the size of the initial skeleton - is crucial to the success of the incremental method. The initial full scaling starts from random positions, and if a large skeleton is to be determined the computational cost may be high. On the other hand, if too few objects are included the configuration will not represent the overall structure adequately. It may be advantageous to scale several times from different random starting points and select the configuration with the lowest Energy (3.13), since the computational cost is dominated by later scaling runs, and the quality of the initial skeleton determines the final outcome.
The sequence
of constants is chosen in the following way:
| u1 |
= |
N |
|
| ui+1 |
= |
 |
|
| ni |
= |
uk-i+1 |
(4.3) |
where k is the smallest natural number such that
for some constant
, and
. Thus, the number n1 of objects to be included in the initial skeleton is determined by
, for given N and
:

. The choice of a suitable value for the exponent
is discussed in Section 4.2.3.
Time Complexity
- 1.
-
O(N2)
- 2.
-
- 3.
- O(ni2) per iteration, with O(ni) iterations: O(ni3)
- 4.
- O(ni) per iteration per object, with O(ni) iterations:
O((ni+1-ni) ni2)=O(ni+1ni2)
| T(N) |
= |
 |
|
| |
= |
 |
(4.4) |
Space Complexity
- 1.
- O(N) , dissimilarities are not stored, but computed on demand
- 2.
- O(1)
- 3.
- O(ni2), dissimilarities are cached for efficiency
- 4.
- O(ni), dissimilarities are cached for efficiency
- 5.
- O(N) for the configuration itself
| S(N) |
= |
 |
|
| |
= |
 |
(4.5) |
Empirical Characteristics
Table 4.1:
Energy and Stress at various stages of optimisation
| Energy |
| data |
initial |
100 iterations |
200 iterations |
final |
| 3D |
0.0782655 | 0.0697988 | 0.0695496 | 0.0694058 |
| 4D |
0.1127830 | 0.0997244 | 0.0989172 | 0.0989170 |
| 5D |
0.1376181 | 0.1161879 | 0.1159678 | 0.1159672 |
| 6D |
0.1567037 | 0.1264447 | 0.1263261 | 0.1262402 |
| Stress |
| 3D |
0.0847296 | 0.0636327 | 0.0641757 | 0.0643987 |
| 4D |
0.1315574 | 0.0944356 | 0.0926183 | 0.0926292 |
| 5D |
0.1779833 | 0.1125812 | 0.1117612 | 0.1117596 |
| 6D |
0.2166460 | 0.1253326 | 0.1248531 | 0.1245055 |
|
Table 4.1 shows results of tests run on synthetic data. Each data set consists of 5000 points randomly placed within a 3-, 4-, 5- or 6-dimensional cube, with a unit diagonal to keep the dissimilarity coefficient - Euclidean distance - uniformly in the range [0,1]. Incremental MDS, based on the NR algorithm of Section 3.3.2, has been applied to represent these data in two dimensions. The initial value of an index, either Energy (3.13) or Stress (3.11), refers to its value after single scaling from 300 to 5000 points. The remaining columns contain values after 100 iterations of full scaling on 5000 points, 200 iterations, and the final value after the algorithm has converged. To circumvent the natural variability of the method, each figure is taken as an average of 10 tests. Because Stress is a different loss function, it is possible for its value to increase when the value of Energy decreases during a minimisation procedure. This effect can be seen for the 3D and 4D tests in the last two columns of Table 4.1.
The results indicate that the initial value of Energy in all cases is already quite low. With 100 iterations it can be brought very close to the final value, i.e. a minimum. Performing a fixed number of iterations of full scaling, for example 100, on all objects, without caching the dissimilarities, would not affect time and space complexities of the algorithm.
Figure 4.1:
Plot of Energy and Stress minima vs. exponent for N-body simulation data
|
In order to suggest a suitable exponent
that determines the sequence (4.3) of constants in the incremental algorithm, extensive testing has been performed on a sample data set, which appears as nbody-2 in Section B.3 of the appendices. This data set is a snapshot of an astronomical N-body simulation [Tout97] consisting of N=14898 objects with 12 attributes. A range
[0.5,0.91] of exponents has been sampled at intervals of 0.0025. For every exponent
the sequence
has been used with the incremental version of the NR algorithm (see Section 3.3.2), and the final value of Energy and Stress in two dimensions recorded. The test has been repeated 10 times, and the resulting minimum value of Energy and Stress for every exponent is plotted in Figure 4.1. The chart also includes a corresponding plot of the minimum over the first three measurements for each exponent. Best of 3 and best of 10 plots coincide for the range
[0.65,0.91], which indicates that incremental MDS for large enough exponents is consistent between multiple runs.
It can be seen from Figure 4.1 that the relationship between exponent and indices is negative exponential:
. The
-axis scale is logarithmic in effect, because
is actually used in the tests. This results in an inverse polynomial relationship between
and indices:
, which implies an even slower decay rate. Therefore, choosing an exponent is a matter of trading off speed for accuracy, with a diminishing advantage for larger exponent values. A value in the range [2/3,3/4] seems to be a good compromise. The fitted exponential regressions predict the Energy minimum for this particular data set at 0.05551, and Stress at 0.03143. Standard MDS for all 14898 objects has yielded 0.05418 and 0.03283 respectively, which suggests that these regressions are a good fit to the measurements. It is worth noting that this single MDS run takes about a week to compute on an Intel Pentium II 300 MHz workstation, which is enough time to test the range
[0.5,0.75] 10 times. However, to extend the measurements up to
requires an additional month of computation.
Principal Components Analysis
Principal Components Analysis (PCA) is a multivariate statistics method for linearly transforming a sample of N p-variate vectors
into a new sample of q-variate vectors
, such that the columns of
are uncorrelated, and
[Flury97]. Thus, each of the q derived variables is expressed as a linear combination of p correlated, measured variables [Hotellin33].
The q derived variables are referred to as Principal Components (PCs), and form a system of orthogonal axes. If we consider
to be a configuration of N points
in a p-dimensional Euclidean space then the first PC defines a line through this space that minimises the sum of squared distances of the points from it, and thus maximises the variance of coordinates
of an orthogonal projection of
on this line [Pearson01]. The second PC is a line that maximises the variance of the projection coordinates
of the points, subject to being perpendicular to the first PC. Taken together the first two PCs give a plane of closest fit to the configuration
, i.e. one that minimises the sum of squared distances of the points to that plane. The remaining PCs are defined recursively in this manner, and individually account for an increasingly smaller amount of the total variance of the p variables of
. Thus, the first q'<q principal components give the best linear approximation, and q' can be chosen such that a sufficient proportion of the total variance is preserved, or set to 2 or 3 to facilitate direct visualization.
Let us define the column centred matrix
, with
, the sample mean vector. An eigendecomposition of the sample covariance matrix
yields eigenvalues
, and associated eigenvectors
. The
matrix
defines the transformation to the first q' PCs:
[Flury97]. Principal Coordinates Analysis (see Section 3.3.1) on the dissimilarity matrix
, with d the Euclidean distance (3.1), results in the same configuration
[Gower66]. However, PCO requires the eigendecomposition of an
matrix of scalar products
, whereas in the case of PCA
is only of order
. Eigendecomposition of a
matrix is an O (k3) computation [Press92], and for
, as is normally the case, PCA will be significantly faster. However, PCA is less general because it requires
as input, which can only be constructed for quantitative, ordinal, or binary data, or a combination thereof (see Section 1.2). When PCA can be applied it is equivalent to PCO, and thus the results of Sections 3.5 and 3.6 are applicable to both.
We have gathered 10 large data tables ranging from 2000 to 58000 rows (see Section B.3 in the appendices), and used them as the basis for comparison of incremental MDS and PCA
. Full scaling (see Section 4.2) was performed with the Simulated Annealing algorithm of Section 3.3.6, single scaling with a version of this algorithm modified to position individual objects (rows) with respect to a fixed skeleton. The parameters of the algorithm are identical to that documented in Section 3.4. Each data set was subject to PCA, and incremental MDS with exponent initially set to
and then to
. To avoid degeneracies of an incremental solution, e.g. a skeleton configuration representing a bad local minimum, the solution was taken as the best in terms of Energy (3.13) of three trials. The target size
was used for the initial skeleton (see Section 4.2.2).
Table 4.2:
Percentage of the total variance explained by the first two or three principal components
| data |
2D |
3D |
| abalone |
94.8 |
97.6 |
| letter |
43.7 |
56.3 |
| mfeat |
91.6 |
98.7 |
| pageblock |
81.4 |
94.1 |
| sat |
84.7 |
89.9 |
| segment |
73.8 |
83.9 |
| nbody-1 |
69.1 |
82.9 |
| nbody-2 |
70.5 |
86.9 |
| shuttle |
89.3 |
99.6 |
| spambase |
23.6 |
29.2 |
|
Table 4.2 shows the proportion of the total variance explained by selecting only the first two or the first three principal components for each data set, which corresponds to the optimal orthogonal projection from the original high dimensional space to a two- or three-dimensional subspace (see Section 4.3). The third dimension substantially improves on the amount of explained variance over just the first two PCs, except for abalone data table, which is already represented well in two dimensions, at 95% of the total variance. On the other hand, letter and spambase are poorly represented even in three dimensions, and are examples of data tables with relatively uncorrelated attributes, which will pose a challenge to any dimensionality reduction method. However, for other data sets in the test bed over 80% of the total variance is accounted for in three dimensions, resulting in respectable visualizations. Since, PCA is likely to fare much better with three-dimensional visualizations, we decided to consider them also in the statistical comparison.
Statistical Analysis
Our experiment consists of a group of Energy measurements - one for each visualization method being evaluated - made for every data collection in the test bed, and has been carried out in both two and three dimensions. We chose to perform the statistical analysis with the Friedman test [Siegel56], for the same reasons as outlined in Section 3.5.
Table 4.3(a) contains Energy measurements in two dimensions for all combinations of a method and a data table. These measurements were subject to the Friedman test with the null hypothesis H0 of no overall difference in Energy between visualization methods, and the alternative hypothesis H1 of at least one method having a different ranking from the others. The values of both Friedman's chi-square statistic and Kendall's coefficient of concordance are not significant at the 5% level, and thus there is no evidence to reject H0. Consequently, post hoc testing, in the form of the Tukey Multiple Comparison procedure [Keselman77] for example, cannot be carried out.
Table 4.3:
Friedman test results for Energy
- 1 an average of the relative order of Energy for a particular method over all data sets
- 2
is Friedman's chi-square statistic; the critical value at p=0.05 level is 5.99
- 3 W is Kendall's coefficient of concordance; it ranges between 0 (no agreement between cases) to 1 (complete agreement); the critical value at p=0.05 level is 0.3
- Values in each row are coded in greyscale, based on their relative rank.
|
Nonetheless, certain patterns emerge from Table 4.3(a). Incremental MDS with
has the highest rank of Energy in most cases, and it achieves the lowest rank only for the shuttle data table. In this case it actually outperforms incremental MDS with
, which is unusual, but possible with an iterative minimisation technique such as this one. PCA and the
method have an almost identical mean rank, and are one rank apart for most data tables. None of the methods construct a good representation for the spambase data set, judging from the Energy values (3.13) alone. This merits a visual inspection, which is presented in Section 4.4.2
Analogously, we have performed the Friedman test on the Energy measurements for three-dimensional representations, with the same null and alternative hypotheses. As can be seen from Table 4.3(b), this time both Friedman's chi-square statistic and Kendall's coefficient of concordance are significant at the 5% level, allowing H0 to be rejected in favour of H1. Tukey Multiple Comparison procedure can now be performed to test for significant differences between method rankings in pairs.
At p<0.05 level the only significant comparison is between incremental MDS with
and
, demonstrating that the latter has a lower overall ranking of Energy measurements. To verify the other two pairwise comparisons, we performed a more powerful one-tailed Wilcoxon signed-rank test [Siegel56]. The test is in agreement with the Tukey Comparison for (PCA,
), in that there is no significant difference in their mean ranks. However, PCA is shown to have a significantly lower ranking than
at p<0.05 level.
An informal inspection of Table 4.3(b) leads to the same conclusions as the statistical inference, and also the evaluation of the two-dimensional representations. Incremental MDS with
has the highest rank of Energy for most data tables, and consequently the highest mean rank of all the methods. PCA and the
method achieve a very similar mean rank, and alternate for the lowest and the second lowest rank for individual data sets.
Table 4.4:
Comparison of running time for PCA and incremental MDS
| 2D |
| method |
total time1 |
time ratio2 |
linear fit3 |
| PCA |
433 |
n/a |
 |
1155 |
2.67 |
2.59 |
 |
2535 |
5.86 |
5.65 |
| 3D |
| PCA |
431 |
n/a |
 |
1198 |
2.78 |
2.64 |
 |
2619 |
6.08 |
5.69 |
- 1 average time in minutes required for a single test trial with the test bed
- 2 ratio of average time for a given incremental method to that of PCA
- 3 the parameter of a least-squares fit of PCA timings to that of each incremental method
|
The total time required for each method to generate visualizations of all data tables in the test bed can be found in the second column of Table 4.4. For incremental MDS each figure is an average over 3 trials; since PCA is an analytical method, only one trial has been performed, and its duration is reported. The total time for each method is very similar across two and three dimensions. With PCA all Principal Components are calculated at once, regardless of how many will actually be used. However, for incremental MDS there is some overhead in calculating distances over an extra dimension, and the total time for three dimensions is slightly higher than for two.
The timings include the calculation of Energy (3.13) for each configuration. For PCA this computation completely dominates the construction of the correlation matrix and its eigendecomposition (see Section 4.3), which itself are only a matter of seconds even for the largest data table. For ease of comparison the ratios of the total time for both variants of the incremental method to that of PCA are given in the third column of Table 4.4. Thus, it can be seen that evaluation of Energy accounts for 37% (
) of the total time to perform incremental MDS with
on the test bed, and 17% (
) for
.
Analogously to Section 3.5.3 we decided to perform a linear regression of PCA timings to that of incremental MDS, to get a more accurate empirical estimate of their ratios. The timings over all data tables in the test bed are collected in a vector
for a given method j in p-dimensions. The linear model is expressed as
, and the parameter ai,p that gives the best least-squares fit can be interpreted as an estimate of how much slower method i is than PCA in p-dimensions, taking the calculation of Energy into account. The values of
and
reported in the final column of Table 4.4 agree well with the corresponding running time ratios, and even better with
and
respectively, signifying that the estimation is accurate.
The time to calculate Energy of a configuration is comparable to that of a single iteration of full scaling (see Section 4.2) with this configuration, since both computations involve the inspection of the entire dissimilarity matrix
. Therefore, following an incremental MDS procedure with full scaling of the complete configuration will add substantially to the running time, for each additional iteration.
Qualitative Evaluation
Table 4.5:
Statistics for two-dimensional visualizations of mfeat, segment, and spambase data tables
| Energy |
| data |
PCA |
 |
 |
MDS |
| mfeat |
0.0578 |
0.0313 |
0.0256 |
0.0201 |
| segment |
0.0927 |
0.0759 |
0.0617 |
0.0469 |
| spambase |
0.2599 |
0.3532 |
0.6638 |
0.1460 |
| Stress |
| mfeat |
0.0127 |
0.0128 |
0.0108 |
0.0141 |
| segment |
0.0441 |
0.0418 |
0.0310 |
0.0261 |
| spambase |
0.3120 |
0.0955 |
0.0861 |
0.1401 |
| time |
| mfeat |
14 |
123 |
235 |
2376 |
| segment |
43 |
190 |
334 |
3582 |
| spambase |
495 |
960 |
1535 |
8228 |
|
We have chosen the three smallest data tables to present visually here. They are easier to represent in a limited space, to allow a side-by-side comparison, but manage to bring out key differences between the methods, nonetheless. We have been able to perform standard MDS on these data due to their manageable size, and include it in the evaluation of PCA and incremental MDS, as the limiting case of the latter. For ease of comparison the relevant Energy values (3.13) of Table 4.3(a) are repeated alongside the results for standard MDS in Table 4.5; Stress (3.11) and the running time for each data set are also given. In all cases MDS achieves the lowest Energy, however, at a marked increase in the running time. Stress results are less consistent, which is understandable as it is Energy that has actually been minimised; thus observations of Section 4.2.3 are replicated.
Visualizations produced by all the methods are arranged side by side in Figures 4.2-4.4 for each data set respectively. Analogously to Section 3.6, MDS configurations have been transformed by Procrustes Analysis [Borg97,Cox94] to have the same orientation as the corresponding PCA configuration. Tucker's congruence coefficient (3.28) has been used to ascertain the agreement between configurations in pairs [Tucker51], and all possible combinations are given in Table 4.6.
Figure 4.2:
Visualizations of the mfeat data set. Black dots in the incremental MDS plots denote the skeleton configuration
|
|
| (a) PCA |
(b)
|
|
|
(c)
| (d) MDS
|
The overall structure of the mfeat data set is represented well by the first two principal components. Three distinct clusters can be seen in Figure 4.2(a); however, the intra-cluster detail is not discernible, and this where the 8% of unexplained variance (see Table 4.2) must lie. Incremental and standard MDS plots in Figures 4.2(b)-4.2(d) are superior, in that they spread the cluster contents out, and portray more clusters than PCA.
and MDS plots are very similar, which is confirmed by the highest value of the Tucker's congruence coefficient (3.28) for this pair, reported in Table 4.6(a). Note that the position of marker points is in a good agreement for these two configurations, and thus attests the validity of the incremental MDS procedure. The match between
and MDS configurations is not as good, due to a smaller number of marker points defining the skeleton configuration:
versus
.
Figure 4.3:
Visualizations of the segment data set. Dark dots in the incremental MDS plots denote the skeleton configuration
|
|
| (a) PCA |
(b)
|
|
|
(c)
| (d) MDS
|
It is possible to identify three clusters in the PCA representation of the segment data set in Figure 4.3(a). There are some dense clumps of points, highlighting areas of unexplained variance (see Table 4.2). Figures 4.3(b), 4.3(c), and 4.3(d) in turn exhibit an increasingly better dispersion of points, showcasing the bias of Energy (3.13) towards preserving local detail (see Section 3.2.4 for more details). Again,
and MDS plots are the most similar visually, and objectively according to the congruence coefficient (see Table 4.6(b)). The distinction between two of the clusters is lost in the
configuration, suggesting that an insufficient size of the skeleton has been specified.
Figure 4.4:
Visualizations of the spambase data set. Black dots in the incremental MDS plots denote the skeleton configuration
|
|
| (a) PCA |
(b)
|
|
|
|
(c)
| (d) MDS
|
The spambase data set presents a challenge to PCA, as according to Table 4.2 only 24% of the total variance is explained by the first two PCs. Only a few outlying data points are represented well in Figure 4.4(a), the remaining ones are grouped in a single clump, with little apparent structure. Incremental MDS is able to spread the contents of the clump, and a dense distribution of the points can be seen in the middle of Figures 4.4(b) and 4.4(c), surrounded by layers of other points, mainly markers. Such a concentric arrangement is indicative of a severe mismatch between dimensionality of the representation and the original data [deLeeuw84]. The two incremental plots are similar, and the congruence coefficient, reported in Table 4.6(c), is much higher for this pair than any other. As can be seen from Figure 4.4(d), standard MDS pushes more points towards the perimeter, and achieves a much lower value of Energy (see Table 4.5); however, it cannot bridge the dimensionality gap, and visually the effect is not very different to incremental MDS plots.
For the spambase data set the subjective observations are in contrast to the statistical analysis of Section 4.4.1. Incremental MDS is judged to be much worse than PCA by the Energy function (3.13), as can be gleaned from Table 4.5. It seems that local detail cannot be represented well for this data set, unless all dissimilarities are considered, as is the case for standard MDS. However, the converse ranking of PCA and incremental MDS is true for Stress (3.11). This loss function assesses the absolute error of a configuration, and thus agrees with the overall, subjective impressions. The reader is referred to Section 3.2.4 for a further comparison of Energy and Stress.
Table 4.6:
Congruence coefficient for pairs of configurations
|
|
The incremental Multidimensional Scaling method presented here alleviates the high time and space complexity associated with standard MDS. This method is capable of visualizing data collections in excess of 100,000 objects at moderate cost. Inevitably, some additional inaccuracy is introduced into configurations. However, structurally significant objects will be represented accurately with respect to one another, and error will be spread over a large number of less significant objects. If enough computing resources are available the resulting configuration can be refined by submitting it as a starting point for a least-squares MDS procedure.
Multivariate data tables can alternatively be visualized with Principal Components Analysis, provided that none of the attributes are measured on a nominal scale. For such data PCA is equivalent to Principal Coordinates Analysis, at a fraction of its computational cost. As has been pointed out in Section 3.7, these methods will not represent the local proximity of objects accurately. However, incremental MDS also compromises local detail to achieve its speed up, and instead focuses on preserving the overall shape of the configuration. The difference is that the trade-off between quality and algorithm responsiveness can be varied, and approach the ultimate quality of least-squares MDS in the limiting case, with the associated long running time.
The objective evaluation of incremental MDS versus PCA was somewhat inconclusive, because of the limited number of data collections on which it was based. Real-world data of the required size is difficult to obtain, but more importantly an exhaustive assessment of the methods with each data set is very time consuming. The key observation is that at the trade-off level where incremental MDS matches PCA for overall quality, the former is much more expensive computationally. However, visual inspection of configurations produced by both methods reveals that objective measurements are misleading in this case, and that incremental MDS provides more pleasing arrangements than PCA, even at a lower objective quality and quicker response setting.
A least-squares MDS algorithm has been proposed that computes the configuration update in linear time in the number N of objects [Chalmers96]. When calculating a new position of an object only two small subsets of the remaining objects are considered. One of them contains the most similar objects to the given object; the other is sampled at random. By keeping the sizes of these subsets constant an O(N) time complexity is achieved for a single iteration. Assuming that O(N) iterations are necessary to find an optimal configuration, the algorithm can be seen to have O(N2) overall time complexity. There are two possible disadvantages to this approach: instability due to optimising against a changing set of objects, and inaccuracies in representing large dissimilarities, which can result in a distorted configuration.
FastMap [Faloutso95] is an analytical MDS algorithm that achieves O(kN) running time, where k is the number of dimensions in the configuration. It is an enormous improvement over PCO (see Section 3.3.1), which is expensive as it involves the eigendecomposition of a
matrix - an O(N3) computation [Press92]. A line passing through the two most distant (dissimilar) data points in the multidimensional space approximates the first principal axis (principal component), which is a line that yields the maximum variance when data points are orthogonally projected onto it (see Section 4.3). All data points are projected onto the hyperplane perpendicular to this line. Successive applications of this procedure to the resulting subspace approximate the remaining principal axes. The resulting axes are necessarily orthogonal. Although very fast, this method only approximates a PCO solution, which itself is likely to be inferior to a least-squares MDS configuration (see Sections 3.5 and 3.6).
An incremental arrangement method [Cohen97] has been proposed to speed up drawing of undirected graphs, and avoid suboptimal solutions. The order of N vertices to be introduced into the drawing is determined by performing a depth-first search from an arbitrary vertex, and reshuffling this list so that every
th vertex is taken first, then every
th, and so on, ignoring duplicates. The first batch of vertices is placed using standard MDS. The next batch is then added to the partial drawing, by positioning each new vertex in proximity of the two vertices introduced previously that are its nearest neighbours in the graph. Another round of MDS follows, and these two stages are applied alternately while there are vertices to be added. The algorithm has been shown empirically to achieve a factor of 2 to 10 speed improvement.
Next: Proximity Grid
Up: Proximity Visualization of Abstract Data
Previous: Multidimensional Scaling
© 2001 Wojciech Basalaj