Next: Multidimensional Scaling
Up: Proximity Visualization of Abstract Data
Previous: Introduction
Subsections
Multivariate Visualization Techniques
Visual exploration of multivariate data is of great interest in Statistics and Information Visualization. A number of methods have been proposed in both fields, ranging from the very useful to the quirky. This chapter introduces a few of the most established multivariate visualization techniques by example. The criterion for selection was generality, and suitability for the non-interactive and flat medium of paper. The effectiveness of the methods is compared, and evaluated relative to their limitations.
Running Example
Table 2.1:
Details of the cars data table
| country |
model name |
mpg |
weight |
ratio |
hp |
disp. |
cyl. |
| USA |
Buick Estate Wagon |
16.9 |
4.360 |
2.73 |
155 |
350 |
8 |
| USA |
Ford Country Squire Wagon |
15.5 |
4.054 |
2.26 |
142 |
351 |
8 |
| USA |
Chevy Malibu Wagon |
19.2 |
3.605 |
2.56 |
125 |
267 |
8 |
| USA |
Chrysler LeBaron Wagon |
18.5 |
3.940 |
2.45 |
150 |
360 |
8 |
| USA |
Chevette |
30.0 |
2.155 |
3.70 |
68 |
98 |
4 |
| Japan |
Toyota Corona |
27.5 |
2.560 |
3.05 |
95 |
134 |
4 |
| Japan |
Datsun 510 |
27.2 |
2.300 |
3.54 |
97 |
119 |
4 |
| USA |
Dodge Omni |
30.9 |
2.230 |
3.37 |
75 |
105 |
4 |
| Germany |
Audi 5000 |
20.3 |
2.830 |
3.90 |
103 |
131 |
5 |
| Sweden |
Volvo 240 GL |
17.0 |
3.140 |
3.50 |
125 |
163 |
6 |
| Sweden |
Saab 99 GLE |
21.6 |
2.795 |
3.77 |
115 |
121 |
4 |
| France |
Peugeot 694 SL |
16.2 |
3.410 |
3.58 |
133 |
163 |
6 |
| USA |
Buick Century Special |
20.6 |
3.380 |
2.73 |
105 |
231 |
6 |
| USA |
Mercury Zephyr |
20.8 |
3.070 |
3.08 |
85 |
200 |
6 |
| USA |
Dodge Aspen |
18.6 |
3.620 |
2.71 |
110 |
225 |
6 |
| USA |
AMC Concord D/L |
18.1 |
3.410 |
2.73 |
120 |
258 |
6 |
| USA |
Chevy Caprice Classic |
17.0 |
3.840 |
2.41 |
130 |
305 |
8 |
| USA |
Ford LTD |
17.6 |
3.725 |
2.26 |
129 |
302 |
8 |
| USA |
Mercury Grand Marquis |
16.5 |
3.955 |
2.26 |
138 |
351 |
8 |
| USA |
Dodge St Regis |
18.2 |
3.830 |
2.45 |
135 |
318 |
8 |
| USA |
Ford Mustang 4 |
26.5 |
2.585 |
3.08 |
88 |
140 |
4 |
| USA |
Ford Mustang Ghia |
21.9 |
2.910 |
3.08 |
109 |
171 |
6 |
| Japan |
Mazda GLC |
34.1 |
1.975 |
3.73 |
65 |
86 |
4 |
| Japan |
Dodge Colt |
35.1 |
1.915 |
2.97 |
80 |
98 |
4 |
| USA |
AMC Spirit |
27.4 |
2.670 |
3.08 |
80 |
121 |
4 |
| Germany |
VW Scirocco |
31.5 |
1.990 |
3.78 |
71 |
89 |
4 |
| Japan |
Honda Accord LX |
29.5 |
2.135 |
3.05 |
68 |
98 |
4 |
| USA |
Buick Skylark |
28.4 |
2.670 |
2.53 |
90 |
151 |
4 |
| USA |
Chevy Citation |
28.8 |
2.595 |
2.69 |
115 |
173 |
6 |
| USA |
Olds Omega |
26.8 |
2.700 |
2.84 |
115 |
173 |
6 |
| USA |
Pontiac Phoenix |
33.5 |
2.556 |
2.69 |
90 |
151 |
4 |
| USA |
Plymouth Horizon |
34.2 |
2.200 |
3.37 |
70 |
105 |
4 |
| Japan |
Datsun 210 |
31.8 |
2.020 |
3.70 |
65 |
85 |
4 |
| Italy |
Fiat Strada |
37.3 |
2.130 |
3.10 |
69 |
91 |
4 |
| Germany |
VW Dasher |
30.5 |
2.190 |
3.70 |
78 |
97 |
4 |
| Japan |
Datsun 810 |
22.0 |
2.815 |
3.70 |
97 |
146 |
6 |
| Germany |
BMW 320i |
21.5 |
2.600 |
3.64 |
110 |
121 |
4 |
| Germany |
VW Rabbit |
31.9 |
1.925 |
3.78 |
71 |
89 |
4 |
|
Several multivariate visualization techniques have been presented with a challenge in the form of the cars data table [Henderso81], which is reproduced in Table 2.1, and is also mentioned in Section B.1 of the appendices. This data table contains a record of 38 cars manufactured in the period 1978-79, with the following attributes:
- 1.
- primary country of the manufacturer
- 2.
- model name
- 3.
- miles per gallon - a measure of petrol efficiency assessed on the race track
- 4.
- weight in thousands of lbs
- 5.
- drive ratio in the highest gear
- 6.
- horsepower
- 7.
- engine displacement in cubic inches
- 8.
- number of cylinders
The first attribute is measured on a nominal scale (see Section 1.2.3), the second is a label (see Section 1.2.8), the remaining attributes are quantitative (see Section 1.2.1). The task set out for each visualization method is that of bringing out the differences and similarities between cars on the basis of their drive parameters. We felt that including the first two attributes would prejudice this analysis, and cause cars from the same manufacturer or just the same country to appear more similar.
Parallel Coordinates
A single row
of a data table with q attributes, measured on any scale apart from nominal (see Section 1.2.3), can be thought of as a point in a q-dimensional Cartesian coordinate system, with the abscissa on the ath axis given by uia. For q>3 such configurations of points cannot be directly visualized; the method of Parallel Coordinates overcomes this limitation by arranging axes vertically, and spacing them uniformly across the plane [Inselber85]. Point
in this coordinate system is a polygonal line connecting the corresponding abscissas on the parallel axes.
It is apparent from the parallel coordinates visualization of the cars data table in Figure 2.1(a) that the last three attributes are substantially correlated. Moreover, lines representing the individual rows cross over between mpg, weight, drive ratio, and horsepower attributes, suggesting that these attributes might be negatively correlated in pairs. Inverting the mpg and drive ratio axes leads to a much clearer visualization in Figure 2.1(b), which could be improved further by permuting the order of axes. The need for such a high level of customisation presents the ultimate obstacle in effectively visualizing many variables with this method, made worse if the number of observations, and hence lines, is large.
Figure 2.1:Parallel coordinates
(a) original
| |
(b) correlated
|
Andrews Plot
In an Andrews plot each row
of a data table with q attributes is represented by a line, similarly to parallel coordinates. In this case it is a curve defined by the following trigonometric function [Andrews72]:
 |
(2.1) |
plotted over the interval
. It is recommended that the most important attributes are associated with the low frequency terms, as they determine the overall shape of the curve. This might entail an iterative and exploratory approach to determine a satisfactory assignment, in the same way as for parallel coordinates.
Let
denote the mean of the n rows
of the data table; function (2.1) preserves this mean:
 |
(2.2) |
so that the plot of
is a pointwise average of the plots for individual rows. Another useful property of (2.1) is that it preserves the Euclidean distance
between pairs of points in the q-dimensional space:
 |
(2.3) |
Thus, close points will result in similar plots, and plots for distant points will be distinct. These features are useful for detecting clusters and outliers, and are common to the parallel coordinates technique. Andrews plots have a number of other characteristics, especially helpful in statistical analysis of the underlying data [Andrews72].
Figure 2.2 is an Andrews plot of the cars data table. There seem to be two extreme clusters of cars. The remaining observations fall between the extremes, and form a loose cluster, which can be separated from the first two at t=-1 and t=2. Additional insight could be gained by plotting these clusters separately, and in fact it is recommended that no more than 10 points
are plotted at a time for a detailed examination [Andrews72].
Multidimensional Scaling
Like parallel coordinates and Andrews plots, Multidimensional Scaling can also be used to visualize multivariate data [Borg97,Cox94]. However, the original q axes and coordinates of points
do not enter the visualization directly. Instead, a configuration of points
is found in a space of lower dimension p<q, such that all inter-point distances
match as closely as possible the original distances
. A two- or three-dimensional embedding is an obvious choice for visualization; higher values of p can be useful for statistical analysis. A more elaborate description of this method is presented in Chapter 3.
It might be helpful to envisage the process of multidimensional scaling in two dimensions as wrapping a surface - an elastic sheet - around points
in the original high dimensional space, and taking
as the projection of
onto this surface. In effect a non-linear mapping between the two configurations is established, and it is likely to be superior for purposes of visualization to rotating a rigid plane in the high dimensional space to find the closest fit to
, a procedure known as Principal Components Analysis [Pearson01].
Figure 2.3:
Multidimensional scaling
|
Figure 2.4:
Scatterplot matrix
|
A two-dimensional multidimensional scaling configuration for the cars data set is presented in Figure 2.3. Inspection of the corresponding Andrews plot in Section 2.3 led to the conclusion that there are three clusters of cars. These clusters are apparent from Figure 2.3, and can be readily verified to group cars with 8 cylinders on the left hand side of the figure, 6 and 5 in the middle, and 4 on the right. In effect a map is constructed that charts individual cars based on the overall similarity of their drive parameters - a Proximity Visualization, in other words.
A scatterplot matrix is a collection of scatterplots organised analogously to a covariance matrix, with variable a plotted against variable b in the ath row and bth column of the matrix [Clevelan84]. The diagonal plots can show the distribution of individual variables, or simply be placeholders for variable names, as is the case for the scatterplot matrix representation of the cars data table in Figure 2.4. Individual scatterplots can reveal correlations between variables, for example linearity, and the complete matrix can be useful for an initial exploration of a data set. However, the display becomes overwhelming with anything more than a few variables; lack of a unified representation of data is also a serious drawback.
Definite correlations between attributes of the cars data table can be seen from Figure 2.4. For example, the weight of a car is proportional to its horsepower, engine displacement, and the number of cylinders, and inversely proportional to its drive ratio and mileage per gallon. Thus, the decision to invert the parallel coordinates for the last two attributes was justified in Section 2.2. The number of cylinders attribute stands out as having only four levels, and separating most other attributes into distinct clusters.
In an iconographic display each icon or glyph represents a single row of a data table. Icons can be arranged in a grid, as in Figures 2.5(a) and 2.5(b), to enable a systematic assessment of similarities and differences between the rows, and also between the attributes. Alternatively, the position of glyphs in the plane can be driven by two of the attributes, providing their spatial interpretation is meaningful. An iconographic display can be combined with the corresponding proximity visualization, by using icons instead of labelled points, to give the resulting visual representation a degree of redundancy.
Figure 2.5:
Iconographic visualizations
(a) star glyphs
(b) Chernoff faces
|
A star is composed of equally spaced radii, as many as the number of attributes in the data table, stemming from the centre. The length of the rightmost spike is proportional to the value of the first attribute for a given row; the remaining attributes are assigned to their spikes counter clockwise in this manner [Fienberg79]. The result of applying this prescription to the cars data table is shown in Figure 2.5(a).
The clarity of a star display will suffer as the number of attributes increases, and grouping correlated attributes to provide smooth transitions between spikes might be beneficial. The similarity or dissimilarity of a pair of stars can be appreciated visually; however, gaining a proper overview of a large data table can become a tedious task. This sort of processing is best left to the computer, so that proximity between rows of a data table can be represented in a direct spatial form, as in Section 2.4.
Stars of Figure 2.5(a) can be classified into a few tight groups. This clustering would become more obvious with the aid of automatic or interactive sorting of stars, to bring the similar ones together. However, this will amount to carrying out multidimensional scaling, as pointed out earlier. An interesting observation is that roughly circular stars, e.g. the one for `Ford Mustang Ghia', appear in the middle of the proximity visualization of Figure 2.3; many more analogies can be found in both visualizations.
Chernoff faces take advantage of the natural familiarity and recognition of human faces [Chernoff73]. Each facial feature represents one variable; obviously, some features are more prominent, and a possible assignment in the decreasing order of importance is:
- area of the face
- shape of the face
- length of the nose
- location of the mouth
- curve of the smile
- width of the mouth
- location, separation, angle, shape, and width of the eyes
- location of the pupil
- location, angle, and width of the eyebrows
In total, 15 attributes can be represented, and additional variables could be encoded by making faces asymmetric [Flury81]. The trouble is that the appearance of a face will vary with the order of assignment of variables to facial expressions, and perceived similarity of faces will be affected.
Figure 2.5(b) is a collection of Chernoff faces representing the rows of the cars data table. Only the first six facial features are used, and the rest is set to a neutral expression. Overall, faces are as effective at portraying similarities as star glyphs. The differences between rows can be detected; however, their magnitude is much more difficult to judge, without detailed knowledge of the assignment of attributes to facial features. Also, it is not possible to tell anymore which icons represent average or extreme observations, e.g. compare `Ford Mustang Ghia' and `VW Rabbit'. Additionally, Chernoff faces share disadvantages of star glyphs, and thus are inferior.
The advantage of multidimensional scaling over other multivariate visualization techniques is that it is independent of the number of variables. As long as it is possible to ascertain the high dimensional distance between observations, by using dissimilarity coefficients of Section 1.2 for example, a low dimensional embedding can be found. The type of variables is also immaterial, and even heterogeneous data can be visualized with the aid of the general dissimilarity coefficient (1.8), including nominal variables, which elude other multivariate visualization methods.
The multidimensional scaling technique scales well with the number of observations, since labelled points or small icons constitute the visual representation of individual observations. Thus, identification of observations is provided, and their actual relationships are represented by proximity. With more observations, the density of icons will increase; however, their relative proximity will be unaffected. Therefore, an informative overview of the data set is presented, highlighting clusters and outliers. Interesting groups of observations can then be analysed separately with this or other multivariate visualization techniques.
Next: Multidimensional Scaling
Up: Proximity Visualization of Abstract Data
Previous: Introduction
© 2001 Wojciech Basalaj