Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
Читать онлайн книгу.href="http://www.ondex.org/">http://www.ondex.org/
2.6 Building High-Confidence PPI Networks
From our discussions on experimental protocols in earlier sections, we know that some protocols—including the AP/MS ones—offer only pulled-down complexes consisting of baits and their preys without specifying the binary interactions between these components. Therefore, binary interactions need to be specifically inferred between the bait and each of its preys within the pulled-down complexes. However, not all preys in a pulled-down complex interact directly with the bait (but, get pulled down due to their interactions with other preys in the complex). Therefore, it is necessary to infer binary interactions not just between the bait and its preys but also between the interacting preys. Yet, care should be taken to avoid inferring spurious (false-positive) interactions between the preys that do not interact. To overcome these uncertainties, often a balance is sought between two kinds of models, spoke and matrix, which are used to transform pulled-down complexes into binary interactions between the proteins [Gavin et al. 2006, Krogan et al. 2006, Spirin and Mirny 2003, Zhang et al. 2008].
The spoke model assumes that the only interactions in the complex are between the bait and its preys, like the spokes of a wheel. This model is useful to reduce the complexity of the data, but misses all (true) prey–prey interactions. On the other hand, the matrix model assumes that every pair of protein within a complex interact. This model can cover all possible true interactions, but can also predict a large number of spurious interactions. An empirical evaluation using 1,993 baits and 2,760 preys from the dataset from Gavin et al. [2006] against 13,384 pairwise protein interactions between proteins within the expert-curated MIPS complexes [Mewes et al. 2006] revealed 80.2% true-negative (missing) interactions and 39% false-positive (spurious) interactions in the spoke model, and 31.2% true-negative interactions but 308.7% false-positive interactions in the matrix model [Zhang et al. 2008]. However, note that many of the missing interactions could be due to the lack of protein coverage in these experiments. A balance is struck between the two models that covers as many true interactions between the baits and preys as possible without allowing too many false interactions [Gavin et al. 2006]; see Figure 2.5.
Gaining Confidence in PPI Networks
Although high-throughput studies have been successful in mapping large fractions of interactomes from multiple organisms, the datasets generated from these studies are not free from errors. High-throughput PPI datasets often contain a considerable number of spurious interactions, while missing a substantial number of true interactions [Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Consequently, a crucial challenge in adopting these datasets for downstream analysis—including protein complex prediction—is in overcoming these challenges.
Figure 2.5 Inferring protein interactions from pull-down protein complexes. Bait–prey relationships from pull-down complexes are assembled using the spokes model, where the bait is connected to each of the preys (A); or using the matrix model, where every bait–prey and prey–prey pair is connected (B). However, these models either miss many true interactions or produce too many spurious interactions. Therefore, a combination of the spoke and matrix models is used, where a balance is sought between the two models using weighting of interactions (C). Interactions with low weights are discarded to give the final set of high-confidence inferred interactions (D).
Spurious Interactions
Spurious or false-positive interactions in high-throughput screens may arise from technical limitations in the underlying experimental protocols, or limitations in the (computational) inference of interactions from the screen. For example, the Y2H system, despite being in vivo, does not consider the localization (compartmentalization), time, and cellular context while testing for binding partners. Since all proteins are tested within one compartment (the nucleus), the chances that two proteins, belonging to two different compartments and are not likely to meet during their lifetimes in live cells, end up testing positive for interaction, is high. Similarly, in vitro TAP pull downs are carried out using cell lysates in an environment where every protein is present in the same “uncompartmentalized soup” [Mackay et al. 2007, Welch 2009]. Therefore, even though two proteins interact under these laboratory conditions, it is not certain that they will ever meet or interact during their life times in live cells. Opportunities are high for “sticky” molecules to function as bridges between proteins causing these proteins to interact promiscuously with partners that never interact with in live cells [Mackay et al. 2007]. Once these complexes are pulled down, the model used to infer binary interactions—between bait and prey or between preys—can also result in inference of spurious interactions (further discussed below). Recent analyses showed that only 30–50% of interactions inferred from high-throughput screens actually occur within cells [Shoemaker and Panchenko 2007, Welch 2009], while the remaining interactions are false positives.
Missing Interactions and Lack of Concordance Between Datasets
Comparisons between datasets from different techniques have shown a striking lack of concordance, with each technique producing a unique distribution of interactions [Shoemaker and Panchenko 2007, Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Moreover, certain interactions depend on post-translational modifications such as disulfide-bridge formation, glycosylation, and phosphorylation, which may not be supported in the adopted system. Many of these techniques also show bias toward abundant proteins (e.g., soluble proteins) and bias against certain kind of proteins (e.g., membrane proteins). For example, AP/MS screens predict relatively few interactions for proteins involved in transport and sensing (trans-membrane proteins), while Y2H screens being targeted in the nucleus fail to cover extracellular proteins [Shoemaker and Panchenko 2007]. These limitations effectively result in a considerable number of missed interactions in interactome datasets.
Welch [2009] summed up the status of interactome maps, based on these above limitations, as “fuzzy,” i.e., error-prone, yet filled with promise.
Estimating Reliabilities of Interactions
The coverage of true interactions can be increased by integrating datasets from multiple experiments. This integration ensures that all or most regions of the interactome are sufficiently represented in the PPI network. However, overcoming spurious interactions still remains a challenge, which is further magnified when datasets are integrated. Therefore, estimating the reliabilities of interactions becomes necessary, thereby keeping only the highly reliable interactions while discarding the spurious or less-reliable ones.
Confidence or reliability scoring schemes offer a score (weight) to each interaction in the PPI network. For an interaction (u, v) ∈ E in the scored (weighted) PPI network G = 〈V, E, w 〉, the score w(u, v) encodes the confidence for the physical interaction between the two proteins u and v. The scoring function w: V × V → R accounts for the biological variability and technical limitations of the experiments used to infer the interactions. The scoring schemes can be classified into three broad categories (Table 2.4): (i) sampling or counting-based, (ii) biological evidence-based, and (iii) topology-based schemes.