Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
Читать онлайн книгу.where (u, v) is the length of the shortest path between two connected nodes u and v. This average shortest path length of the PPI network is significantly shorter than that of an ER random network; in an ER network, the average shortest path length is proportional to ln n. However, both PPI networks as well as ER networks fall under “small-world” networks, because the average path length between any two nodes is still significantly smaller than the network size: n >> k >> ln n >> 1, where k is the average node degree. The average clustering coefficient of a PPI network is significantly higher than a random network constructed on the same protein set and with the same average shortest path length.
This small-world property can also be quantified using two coefficients: closeness and betweenness [Hormozdiari et al. 2007]. The closeness of v is defined based on its average shortest path length to all other proteins reachable from v (that is, within the same connected component as v) in the network,
where R(G, v) is the set of proteins reachable from v. Closeness is thus the inverse of the average distance of v to all other nodes in the network. The betweenness of the node v measures the extent to which v lies “between” any pair of proteins connected to v in the network. Let Sxy be the number of shortest paths between all pairs x, y ∈ R(G, v), and let Sxy(v) be the number of these shortest paths that pass through v. The betweenness Bet(v) of v is defined as:
The distributions of closeness and betweenness coefficients for PPI networks are significantly different from that of random networks [Hormozdiari et al. 2007]. In particular, PPI networks contain central proteins which have high betweenness and these connect and hold together different groups of proteins or regions of the network.
2.4 Theoretical Models for PPI Networks
Studying the topological properties of PPI networks has gained considerable attention in the last several years, and in particular various network models have been proposed to describe PPI networks. As new PPI data became available, these theoretical models have been improved to accurately model the data. These include the earliest models such as the Erdös-Rényi model [Erdös 1960, Bollobás 1985], and more recent ones such as the Barabási-Albert [Barabási 1999, Albert and Barabási 2002], Watts-Strogatz [Watts and Strogatz 1998], and hierarchical [Ravasz and Barabási 2003] network models.
With a fixed probability for each edge, the Erdös-Rényi (ER) networks (discussed above) look to be intuitive and seemingly correct, and were once commonly used to model real-world networks. However, in reality, ER networks are rarely found in real-world examples. Most real-world networks—including road and airline routes, social contacts, webpage links, and also PPI networks—do not have evenly distributed degrees. Moreover, although ER networks have the small-world property, they have almost no clustering effects. For these reasons, using ER networks as null models for comparison against real-world networks including PPI networks is usually inappropriate.
In the Barabási-Albert (BA) model [Barabási 1999, Albert and Barabási 2002], nodes are added one at a time to simulate network growth. The probability of an edge forming between an incoming new node and existing nodes is based on the principle of preferential attachment—that is, existing nodes with higher degrees have a higher likelihood of forming an edge with the incoming node. Hence, the degree distribution follows a power-law distribution (scale-free) P(k) ∼ k−γ and exhibits small-world behavior, but like ER, lacks modular organization (low clustering behavior). The BA model is particularly important as one of the earliest instances of network models proposed as a suitable null model, instead of ER, for comparisons against biological networks [Barabási 1999, Albert and Barabási 2002]. This led to a panoply of works describing the properties of various biological network types including metabolic [Wagner and Fell 2001] and PPI networks [Yook et al. 2004] using BA. At the same time, the deficiencies of the BA model started to become more clear: while the scale-free property is preserved, modularity is not. Thus, better null models are needed.
The Watts-Strogatz (WS) model [Watts and Strogatz 1998] is seldom used for modeling biological networks but demonstrates how easy it is to transition from a non-small world network (e.g., a lattice graph) to small-world network with random rewiring of a few edges. The generation procedure is amazingly simple: From a ring lattice with n vertices and k edges per vertex, each edge is rewired at random with probability p. If p = 0, then the graph is completely regular, and if 1, the graph is completely random/disordered. For the WS model, the intermediate region, where 0 < p < 1 can be selected to examine its intrinsic properties. Aside from potential use for investigating the level of ordered-ness of an observed network, the WS model has seen few applications in biology. Its major contribution toward network biology is that given how easy the small-world property can emerge from minor disruption of a regular lattice graph, it is not a unique defining property. Therefore, the myriad of research papers that describe the biological network under analysis as small world are not reporting particularly useful information.
The BA model is able to capture the scale-free property observed in biological networks but has very low internal clustering. This seemed at odds with the idea that biological networks, or at least PPI networks, are modular in nature. Proteins achieve their functionality by virtue of extensive interconnections with other proteins, forming simple physical interactions, which at higher levels can be envisaged as complexes and functional modules. To better capture the clustering effects in real biological networks, hierarchical models were proposed. Hierarchical network (HN) models [Ravasz and Barabási 2003] are iterative approaches for generating networks that encapsulate both scale-free and highly clustered behavior. For a real network, the average clustering coefficient for the entire network is higher than the BA model and ER model. To construct a HN model, tightly clustered cores with high clustering coefficient are first generated. These are then iteratively connected by selecting random nodes in each core, and having these connect to another. The downside, however, is that developing HN models requires making certain assumptions. For instance, we have to assume that we know the distribution of the sizes and clustering densities of the embedded modules. We also assume that these modules combine in an iterative manner. In biological networks, however, it seems that the distinction boundary between modules is not very sharp with high levels of interconnectivity.
PPI networks from the yeast S. cerevisiae and the bacteria H. pylori resulting from some of the high-throughput studies mentioned earlier [Uetz et al. 2000, Ito et al. 2000, Xenarios et al. 2002] have been shown to have scale-free degree distributions [Pržulj et al. 2004, Jeong et al. 2001, Maslov and Sneppen 2002]. However, the larger D. melanogaster (fruit fly) PPI network has been shown to decay faster than a power law [Giot et al. 2003]. Furthermore, the shortest path distribution and the frequencies of cycles of 3–15 nodes in the fruitfly network differ from those of randomly rewired networks which preserve the same degree distribution as the original PPI network [Giot et al. 2003]. To better capture these frequency distributions of node-cycles, geometric random graph models were proposed. In a geometric random graph model, the nodes correspond to independently and uniformly distributed points in a metric space, and two nodes are linked by an edge if the distance between them is smaller than or equal to some radius r, where the distance is an arbitrary distance norm in the metric space [Penrose 2003]. Pržulj et al. [2004] used geometric random graphs to model PPI networks, by defining the points in 2D, 3D, and 4D Euclidean space, and distance between the points measured using Euclidean distance. Pržulj et al. studied the similarity between geometric random graphs and PPI networks using the distributions for graphlets. A graphlet is a connected network with a small number of nodes, and graphlet frequency is the number of occurrences of a graphlet in a network. The authors defined 29 types of graphlets using 3–5 nodes (Figure 2.3). The relative frequency of a graphlet i