Next: Multivariate Visualization Techniques
Up: Proximity Visualization of Abstract Data
Previous: Abstract
Subsections
The focus of this dissertation is visualization of abstract data collections, and an evaluation of a particular technique - Proximity Visualization - suitable for the task. Database tables with many attributes, graphs, and multimedia collections are types of data collections for which it is difficult to design useful visual representations, since there is no immediate way to control the position of elements based on their content. However, similarity between elements of such collections can be measured, and a good overview picture should respect this proximity information, by positioning similar elements close to one another, and far from dissimilar ones. Such a visualization is in effect a topology preserving map of the underlying data collection.
Data Types
The proximity of a pair of objects from a data collection can be expressed either as their similarity, mutual agreement, or dissimilarity - their distance in the abstract domain of the collection. The latter form is preferable for constructing a proximity visualization of the collection, as distances between objects in the visual representation are to match the corresponding distances in the abstract domain. However, there exist transformations between the two variants of proximity measurements, and as long as at least one can be established for pairs of objects from a given data collection, its proximity visualization can be formed.
A function that measures dissimilarity between a pair of objects from a collection is a dissimilarity coefficient. There may be a number of coefficients that could be used with a particular data collection, which in general will emphasise different aspects of object proximity. This section discusses suitable dissimilarity coefficients for a comprehensive range of types of abstract data collection, which reflect our experiences with applying proximity visualization. It seems reasonable to assume that other types could be serviced by one of these dissimilarity coefficients, perhaps after certain reductions.
Quantitative
A quantitative variable can either be measured on an interval or a ratio measurement scale, for example temperature. If temperature is measured in degrees Celsius, i.e. an interval scale, the difference between two measurements t1 and t2 is meaningful, and so is the ratio of such differences:
(t1-t3)/(t2-t3). Additionally, a ratio scale, e.g. the Kelvin temperature scale, has a well defined zero point, and thus a ratio of two measurements t1/t2 becomes meaningful. Since the dissimilarity between two measurements can be defined in terms of their difference, both scales can be treated in the same way.
Suppose we have a set of n objects with the
ath attribute measured on the quantitative scale
, and the range of this attribute is
. The dissimilarity
between a pair of objects (i,j) with regard to the
ath attribute can be defined as follows [Anderber73]:
 |
(1.1) |
Dissimilarity coefficient (1.1) is always between 0 and 1, and can be viewed as a fractional disagreement between objects i and j on the
ath attribute, relative to the maximum disagreement given by Ra.
For quantitative measurement scales (and also those of Sections 1.2.2, 1.2.3, and 1.2.4) there exist many ways of combining dissimilarity scores on multiple attributes into an overall dissimilarity between a pair of objects [Anderber73]. However, we will not go into their details here, as an assumption that attributes are homogeneous, i.e. measured on the same scale, is very restrictive, and we prefer to treat this condition as a special case of heterogeneous data in Section 1.2.5.
Ordinal
An ordinal scale is weaker than quantitative, in that it also induces an ordering of objects, but does not make any statement about the magnitude of the differences. An example of an ordinal scale is the A-F system of grades; although there is a well defined ordering
, it is impossible to say how the differences between pairs of grades relate, for instance whether the difference between A and B is greater than that between B and C. However, the difference between A and C must be greater than the difference between A and B, due to the inequalities, and an effective dissimilarity coefficient for a pair of objects measured on an ordinal scale can be based on differences in their rank.
Let a set of n objects with the
ath attribute measured on an ordinal scale be denoted as
. If the rank order of an observation uia is given by r(uia), and the number of distinct ranks is
, the dissimilarity between objects i and j on the
ath attribute takes the following form [Anderber73]:
 |
(1.2) |
Coefficient (1.2) achieves in effect a transformation of an ordinal variable into a quantitative one, by replacing the ordinal measurements
with their rank order
. Analogously to Section 1.2.1, (1.2) is also in the range [0,1], and represents the fractional disagreement between ranks of objects i and j on the
ath attribute, relative to the maximum observed rank difference.
Nominal
A nominal scale allows for the weakest form of measurement, as it does not enforce any ordering of objects. A number of non-overlapping categories are defined and an object can belong to a single one of these. The similarity of a pair of objects is defined in terms of their category membership: if they belong to the same category they are judged to be equivalent, otherwise they are completely different. An example of such an exclusive categorisation is a person's zodiac sign. For a set of n objects whose
ath attribute is measured on a nominal scale
pairwise dissimilarity between objects is defined as [Anderber73]:
 |
(1.3) |
Binary
A binary scale of measurement is a special case of a nominal scale with only two categories. If both categories have the same significance, for example male-female, (1.3) can be used as the dissimilarity coefficient for such an attribute. However, if one of the categories denotes an absence of some property, e.g. yes-no, on-off, a negative match of a pair of objects is not considered as significant as a positive match. The dissimilarity between a pair of objects from the set
, with the
ath attribute measured on a binary scale, then becomes [Anderber73]:
 |
(1.4) |
In creating ontologies entities are scored on a number of binary attributes, and typically only a small proportion of them will be positive for any given entity. In such cases it might be advantageous to treat an attribute on which both entities scored negative as missing (see Section 1.2.5), so as not to unduly bias the judgement of their pairwise similarity [Gower71].
Heterogeneous
A collection of entity descriptions may be conveniently represented by a set of tuples or a set of objects with appropriate attributes. The utility of relational and object databases is based on this premise. The canonical representation of such data is a table with one row for each object, and a column for each attribute - Table 2.1 provides an example. In general, the attributes cannot be expected to be homogeneous, and thus a dissimilarity coefficient for data tables has to combine attributes measured on arbitrary scales, to give an overall dissimilarity between pairs of objects (rows).
A general similarity coefficient [Gower71] is a suitable coefficient for heterogeneous data, and can accommodate missing values and conditional attributes - a crucial ability if any possible real world data is to be considered. An analogous general dissimilarity coefficient can be defined for a set of n objects with q attributes
as follows:
 |
(1.5) |
where wija takes value 1 if objects i and j can be compared on the
ath attribute, and 0 otherwise; wa is the weight given to attribute a;
is the dissimilarity between objects i and j as measured on the
ath attribute according to (1.1), (1.2), (1.3), or (1.4) depending on the type of this attribute.
Each
provides the distance in a uniform scale between a given pair of objects (i,j) with respect to a single attribute a. The general dissimilarity coefficient is a framework for combining individual distances for all attributes into an overall distance in a uniform representation space. It can be seen that a weighted City block metric, normalised by the maximum distance, is used in effect:
 |
(1.6) |
where

is a vector of distances for individual axes. Therefore, the general dissimilarity coefficient can be generalised to any instance of the weighted Minkowski metric
:
 |
(1.7) |
to give the
-general dissimilarity coefficient:
 |
(1.8) |
For
and purely quantitative attributes (1.8) simplifies to the weighted Euclidean distance normalised to the range [0,1], and for this desirable property we decided to use this form of the
-general dissimilarity coefficient for data tables in the sequel.
Relationships
Binary relationships
within a set V of entities can be expressed as a graph
, with an edge (a,b) whenever entities a and b enter into a relationship, and an entity being synonymous to a vertex. In general, a pair of entities might enter into an indirect relationship through one or more intermediate entities. Such a chain of vertices defines a path through the graph G that has a length equal to the number of chained vertices less one, i.e. the number of edges that have to be traversed.
The graph theoretic distance l is a function that for a given pair of vertices
returns the length of the shortest path between them. For undirected graphs l satisfies metric properties:
- 1.
- non-negativity:
, for all
- 2.
- identity: l(a,b)=0, if and only if a=b
- 3.
- symmetry:
l(a,b)=l(b,a), for all
- 4.
- triangle inequality:
, for all
. If c is a vertex on the shortest path between a and b then
l(a,b)=l(a,c)+l(c,b). Otherwise,
l(a,b)>l(a,c)+l(c,b) cannot hold because a path through c would be shorter.
These properties are preserved for a disconnected graph if the length of the path between vertices from separate components is considered to be infinite, though for practical purposes a value greater than the longest path in all of the connected components will suffice. Thus, l is a suitable and intuitive dissimilarity coefficient for undirected graphs [Kruskal78a].
Directed graphs can be accommodated by treating all edges as undirected for the purposes of calculating dissimilarity between vertices, and taking account of the direction of edges only in their visual representation. Alternatively, the third metric property could be dispensed with, making dissimilarities asymmetric. If the strength of relationships can vary appropriate weights can be assigned to edges, and l generalised to return the path with the least cumulative weight for a given pair of vertices.
Images
Humans are adept at determining similarity between a given pair of images, by utilising the capabilities of the visual cortex to recognise objects in both images, and establishing semantic relationships between these objects. This process is difficult to emulate on a computer system, with the present knowledge of neurology and artificial learning. However, it is practical to automatically extract low-level features from the images, such as colour and texture, and compare their similarity to give an overall similarity between the two images. In our work we have been using IRIS (the acronym stands for Image Regions In Summary) - an image similarity coefficient developed at AT&T Laboratories in Cambridge, which combines global image properties in the form of colour histograms and local, region based features [Rodden99a].
During a pre-processing step an image is segmented into regions of broadly homogeneous colour properties. Regions are then classified as either large or small. The large regions are further classified as either textured or smooth, and the small regions as regularly or irregularly shaped. The image is partitioned into nine areas in a 3 by 3 grid. Subsequently, a colour histogram for each of the 4 types of region is recorded, and the largest (dominant) region is identified, in an area summary. The dissimilarity between a pair of images is computed from their summaries: the
statistic is used to determine the distance between histograms from corresponding areas, the Mahalanobis distance in RGB colour space is used for dominant regions, and the final coefficient is a weighted sum of these distances.
IRIS has been designed specifically to perform well on indexing and searching of image databases. There exist a multitude of alternatives, contributed by the Image Retrieval community, for calculating low-level visual similarity between pairs of images. These coefficients are of varied complexity and retrieval performance, however they seem to be roughly equivalent for the purposes of visualization [Rodden00].
Text Corpus
It is common practice to index a corpus of documents on themes occurring within it, e.g. nouns or phrases, to facilitate querying for relevant themes. In a classic model of Information Retrieval - the vector model - each document is represented by a weight vector
, where element wa specifies the relative importance of theme ta in the document, and q is the total number of themes and the dimensionality of the weight vector space [Baeza-Ya99]. If the theme ta does not occur in the document, the element wa is simply set to 0, otherwise wa is made proportional to the frequency of ta's occurrence in the document, and inversely proportional to the commonness of ta within the corpus. Thus, a theme frequently occurring in a given document will be assumed to have high relevance, unless it is common to all the documents, e.g. `computer' in a corpus of computer science articles.
A measure of similarity sij between documents i and j, or equivalently vectors
and
in the vector model, is a cosine of the angle between
and
:
 |
(1.9) |
where
denotes the Euclidean norm (length) of the vector
. Since theme weights are non-negative, sij will be in the range [0,1], and to convert it to a dissimilarity
between documents i and j, the following formula can be used:
 |
(1.10) |
Let
be the matrix of normalised weights for all n documents in the corpus; the complete similarity matrix can be expressed as
, and is positive semi-definite by definition. Consequently, the dissimilarity matrix calculated by applying (1.10) to every element of
is Euclidean
[Gower86] - a desirable property for its visual representation.
Proximity
There are data collections that consist solely of proximity measurements, for example results of an experiment where a subject has been asked to rate how similar pairs of stimuli are. The only meaningful visual representation for such data is one based on proximity. If similarity measurements have actually been collected they have to be transformed to the corresponding dissimilarity scores, prior to constructing their visualization, by using (1.10) for instance. Additionally, unstandardised similarities have to be scaled by the largest similarity to bring them to the range [0,1], which is a precondition for the transformation (1.10).
Since judgements of dissimilarity have been collected directly, they are unlikely to obey Euclidean or even metric properties - most importantly triangle inequality - which will make their proximity visualization of dubious quality. However, by adding a sufficiently large positive constant to every dissimilarity, except self-dissimilarities which should always be 0, they can be made metric and Euclidean [Gower86,Borg97].
The magnitude of dissimilarity judgements may be deemed not to be reliable, and only their relative order, exactly the distinction between a quantitative scale (see Section 1.2.1) and an ordinal scale (see Section 1.2.2). In this case the dissimilarities should be replaced by their rank order. Such a procedure will cancel the effect of outlying dissimilarity measurements, which could otherwise greatly distort the visual representation.
For the types of data discussed in Section 1.2 there exist many visualization techniques; however, they are applicable to just a single data type, in general. On the other hand, proximity visualization is generic, and can be used with any data collection, as long as a suitable dissimilarity coefficient can be defined for it. Analysis of quantitative data is the main concern of Statistics and Data Visualization, and thus the greatest number of visualization methods target collections of objects measured on multiple quantitative variables. Chapter 2 presents an overview of multivariate visualization, and compares the most influential techniques.
An established technique for creating proximity visualizations is introduced in Chapter 2, and its detailed account is the purpose of Chapter 3. The underlying theory is presented first, followed by examples of practical algorithms for meeting that specification. To put the discussion on a more rigorous footing, these algorithms are subsequently evaluated based on the quality of proximity visualizations they supply, and their responsiveness.
The algorithms of Chapter 3 are unsuitable for very large data collections, as they will require an inordinate amount of time for computing the corresponding proximity visualizations. Chapter 4 discusses the use of cluster analysis to pre-process the data, so that the most important objects from a collection are identified. The proximity of such marker objects is visualized exactly; the remaining objects are located by reference to these markers only. Thus, a way of achieving a compromise between quality and responsiveness is established.
Chapter 5 takes a different approach to proximity visualization. The emphasis is not on preserving dissimilarities as well as possible, or coping with large data collections, but with achieving a display with high information density. Such a display allows the use of complex icons to augment the visual representation of proximity, useful for browsing a multimedia collection or searching for a specific element from it, for example. This scenario is further explored in Chapter 6.
A browsing interface for a collection of images, based on proximity visualizations using both visual proximity of images and proximity of image captions, is just one of the case studies presented in Chapter 6. Interactions between proteins in a cell can be considered as an undirected graph, and an example protein graph drawing is studied in comparison to an equivalent textbook diagram. The final case study is that of database visualization, where a unified view of data and metadata can be achieved by means of proximity visualization.
The results of this dissertation are summarised in Chapter 7, and some concluding remarks are offered.
Chapters 3, 4, and 5 each present a number of alternatives for creating proximity visualizations. Given a visual representation, it is possible to assess objectively how well dissimilarities among all pairs of objects from a collection are preserved. Thus, every algorithm is exercised with a test bed of data collections, and an objective measurement of quality is taken for each test case. Subsequently, statistical analysis is applied to rank the algorithms by their overall quality. An analogous analysis is performed with respect to the time taken to compute proximity visualizations with each algorithm, and conclusions are drawn from the trade-off between these two criteria.
It is important to complement the statistical analysis with a subjective assessment of proximity visualizations produced with different algorithms. It is conceivable to imagine that algorithms can be comparable in terms of objective quality, but deliver proximity visualizations with different distribution of objects, or other disparate characteristics. Consequently, examining these visualizations might establish that the output of a certain algorithm is more pleasing or intuitive, and thus favour it over others. For algorithms that differ significantly on objective quality it is still advantageous to compare visualizations generated by them, to find out the character of the differences.
Next: Multivariate Visualization Techniques
Up: Proximity Visualization of Abstract Data
Previous: Abstract
© 2001 Wojciech Basalaj