The analysis of protein-protein interactions within a cell is concerned with understanding the function of individual proteins and networks of proteins. The presence of an interaction between a pair of proteins implies the existence of a binary relationship. There are various types of interactions, but the ones we analysed are all bidirectional. As discussed in Section 1.2.6, such a collection of entities and relationships between them can be expressed as a graph, which is undirected in this case. Thus, the shortest path length between a pair of proteins in the graph is taken as their dissimilarity.
A study of a protein interaction cluster in yeast, a unicellular organism, is presented here. This cluster is of great interest to pharmaceutical companies, as it represents a signal transduction pathway that is responsible for passing stimuli from the cell membrane to its nucleus. This process is initiated by a membrane-bound receptor, a G-protein, GPA1 [Rens-Dom95]. Figure 6.1(a) is a diagram put together by a domain expert of this network of interacting proteins, with arrows to represent both enzymatic mechanisms and movement. Figure 6.1(b) is a proximity visualization of the same protein cluster, with interactions gathered from biochemical and library based experiments. In general, only the proteins relevant to the transduction mechanism are common to both drawings, and other proteins might not have a counterpart in the other drawing.
Overall, Figure 6.1(b) captures more information, though a few interactions are missing as a result of experimental errors. The essential edges are: GPA1 interacts with SST2, which in turn interacts with MPT5; and the cascade STE11, STE7, FUS3/KSS1. Unfortunately, the interaction between STE11 and STE7 is missing, but the rest of the architecture is there: STE5, STE50, and STE11. According to biologists that inspected both drawings the correlations between them are impressive.
Figure 6.1(b) has been created with the hybrid algorithm of Section 3.5.5 by minimising Energy (3.13). It follows from the argument of Section 3.2.4 that any deviation in distance between adjacent vertices is penalised more than the same error for non-adjacent vertices. Therefore, a straight-line drawing based on an optimal vertex layout will have uniform edge lengths. At the same time an even spread of vertices is achieved, because the distance between a non-adjacent pair of vertices tends to be proportional to their dissimilarity, i.e. their shortest path length. Overall, the dissimilarities between vertices will be preserved as closely as possible, and hence the topology of the graph. Multidimensional scaling has been used for generating graph drawings before [Kruskal78a,Kamada89,Cohen97]. It is worth noting that this method does not explicitly minimise edge crossings or maximise drawing symmetry, other important graph drawing aesthetics [Battista94], but still achieves pleasing results.
There are many alternative algorithms for drawing graphs [Battista94]; however, the advantage of using MDS for visualizing protein interactions is that it is not specific to a graph representation of these data. Instead of a dissimilarity coefficient based on graph theoretic distances, one that takes the location of proteins within the cell into account can be used, for example. A comparison of configurations resulting from these two representations might help to clarify the relationship between the location of a protein and its interactions.
It is customary for collections of protein interactions to be drawn manually, or be presented in a tabular form. However, these approaches are not practical for anything but small collections, and automatically generated two- or three-dimensional graph drawings are a welcome alternative. To our knowledge, it is the first time a graph drawing technique has been applied in this context. With typically large quantities of protein interactions in most collections, visual representation is an ideal method to communicate such complex information within the biological community. Moreover, proximity visualizations are invaluable for discovering data inconsistencies, and carrying out comparative studies between organisms [Basalaj99].
Thanks to improvements in storage and networking technologies, large digital collections of images are now commonplace. To facilitate access to these collections, it is necessary to perform some indexing of their content. If textual annotations exist it is possible to use conventional information retrieval methods, for example the vector model of Section 1.2.8. However, annotating images is a time-consuming process, and the results tend to be highly subjective. An alternative is to classify images based directly on their content. The most common approach is to use low-level visual features of images such as colour and texture, extracted by the IRIS similarity coefficient of Section 1.2.7, for instance.
Forming a visual query to an image collection is considerably more difficult than a textual query. The user is required to either sketch a prototype image or select a suitable example by browsing the collection. The query result will be a set of images judged by the system to be the most similar. This is based, however, on the low-level visual features, and thus might disagree with the user's perception. There can be many images satisfying a query, especially if it is not well formed. These difficulties make provision of good support for browsing paramount.
A proximity visualization of an image collection, with each individual image represented by a thumbnail, appears to constitute a good paradigm for a browsing interface. To prevent thumbnails from overlapping one another, they can be arranged in a proximity grid. Figure 5.1(b) serves as an example, arranging 100 images of Kenya by visual similarity. The combination of the image similarity coefficient and proximity grid has uncovered a natural structure in the collection. Photographs of people are clustered in the top right corner, buildings are to the left, and wildlife is at the bottom. Existing image browsing tools commonly employ grid arrangements, as they assist in systematic scanning of a collection; however, thumbnails are positioned arbitrarily or chronologically, instead of according to their similarity.
Figure 6.2(a) is another example of an image collection, 100 photographs of New York this time, arranged by visual similarity in a proximity grid. This display makes it very easy to locate photographs with similar colour composition, e.g. sunsets in the top right corner, night time photos at the top left, and panoramas at the bottom. Each image has a caption associated with it, describing its main features in a few keywords. Thus, it is thus possible to give an alternative view of the collection, with photographs arranged by caption similarity. All thematically related images are grouped in Figure 6.2(b), e.g. photographs of the Statue of Liberty. Both views are complementary, as they support different types of browsing tasks.
Because of this inherent structural complexity, a database schema (metadata) deserves to be visualized in its own right. A preliminary metadata analysis can assist in planning and structuring a subsequent data analysis. There is an enormous benefit in integrating both forms of visualization, because the metadata presentation provides context for exploring and navigating through detail presentations of data. Although a database is likely to change over time, its schema, if properly defined to start with, will undergo few changes. Having a stable starting point for the visualization process is important, and metadata presentation is an ideal choice.
Integration of metadata visualization, to convey the overall structure of a given database, and data visualization, to reveal the actual detail, enables the database to be analysed in its entirety. Most existing database visualization systems, e.g. DataSpace [Petajan97], Exbase [Lee96], VizDB [Keim94], focus solely on data visualization, disregarding the benefits of metadata visualization. Such an approach is satisfactory for a homogeneous database, but fails in the more general and common heterogeneous case.
An E-R diagram is a graph, in essence, with vertices having complex properties and possibly being nested. Visualizing it in a proximity grid will allow large visual representations of entities to be accommodated, without causing them to overlap. It is convenient to arrange that by interacting with this schema representation a viewer can drill down to the entity (type) instance level. At the same time the E-R diagram provides context for visualization of this higher level of detail, and thus assists in exploration of the database. For example, a fisheye lens metaphor [Furnas86] can be employed, so that details of data and relationships can be explored, while maintaining the context at the periphery.
Figure 6.3 gives an example of the proximity visualization of a database table, based on the crcars data table [Donoho83], which also appears in Section B.1 in the appendices. The following attributes of 406 cars have been recorded: model name, miles per gallon, number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and origin - American, European, or Japanese; with some values missing. The Euclidean
-general dissimilarity coefficient (1.8) has been used to calculate dissimilarities between cars in pairs. All attributes have been given equal weight, except model name and origin, which have been given zero weight, and thus effectively excluded from the calculation of dissimilarity for the same reasons as in Section 2.1.
Individual cars are represented as solid circles in Figure 6.3, with the green colour component assigned to the number of cylinders attribute, and the red component linked to model year. These two attributes can be seen to explain the data well, with the former determining the horizontal location of cars, and the vertical direction coinciding with the latter attribute. Three clusters can be clearly identified, from left to right: 8, 5 or 6, and 4 cylinder cars. This structure is easy to explain once it is realised that the number of cylinders strongly determines other characteristics of a car, like its weight and horsepower, as do technological advances captured in the model year attribute.
A Minimum Spanning Tree (MST) can be computed for the cars data table, as in Section 4.2.1, to identify the most similar pairs of cars - leaves in a single-link cluster hierarchy. The MST is superimposed on the configuration of Figure 6.3, where the weight of an edge connecting a pair of cars is equal to their dissimilarity, and is coded in greyscale, ranging from black for an identical pair to light grey for the most dissimilar pair. In a perfect geometrical representation of the data, no edges would cross over, and every vertex (car) will be connected by an edge to its closest neighbour. Thus, overlying the MST on the configuration allows one to see easily which links are badly represented, and which vertices are positioned inaccurately [Gordon81]. Based on this criterion the visualization of Figure 6.3 can be seen to have a small degree of error. Such a graph representation ties in naturally with the schema graph.
Proximity visualization can be used to represent more than one database table or object extent a time. Potentially, if two sets of tuples or objects have some overlapping attributes it will be possible to calculate pairwise dissimilarities within the amalgamated set, and subsequently apply proximity visualization. This could be an effective way of visualizing relationships and inheritance at the detailed, instance level, analogously to drawing a bipartite graph.
On the other hand, proximity visualization is also suitable for representing a subset of instances from a database table or an object extent, resulting from a query, for example. Thus, a viewer could select a subset of elements in a visualization, in order to analyse the corresponding instances in more detail. Apart from a specialised view of data, this query-by-example mechanism can be used to construct a generalised view. Such a natural evolution of analysis, coupled with provision for backtracking, will encourage the viewer to explore data.
© 2001 Wojciech Basalaj