Bioinformatics and Medical Applications. Группа авторов
Читать онлайн книгу.Predictors
There is a practical interest, besides the scientific interest, in aiming efforts to determine the predominant function of a protein. Let us start by saying that proteins do not have a single action on a pathogen; on the contrary, all proteins have action on several pathogens. For this reason, when examining a database specialized in proteins, it is common to find that the same protein is reported several times, with a different predisposition.
On the other hand, determining the action or function of a protein involves costly experiments and/or clinical trials, without mentioning that there are proteins with pathogenic action that are increasingly difficult to find in nature; as in the case of SCAAPs (selective cationic amphipathic antibacterial peptides) that are highly toxic to bacterial membrane and harmless to erythrocytes.
SCAAPs are also very short 6aa proteins – 14aa; for all these characteristics, they are very valuable in the production of pharmaceutical drugs; however, it is increasingly difficult to find them in nature.
In this scenario, it is very useful to have computational mathematical predictions that can identify the preponderant function of a protein only by taking its sequence. This enables the inspection of databases to search peptides with a specific function. Of course, this will not prevent experimental testing, but it will substantively reduce the proteins tested.
There are several types of classification for prediction algorithms of the protein predominant function, such as the representation of proteins in three-dimensional space rather than in linear representation. Others use stochastic algorithms instead of deterministic algorithms that may or may not evaluate physico-chemical properties.
In this work, we used two main divisions: the supervised algorithms and the non-supervised algorithms, both classifications are discussed below.
3.4.1 Supervised Algorithms
A supervised algorithm is a particular computational code that requires calibration or training to know what to look for. This makes them programmer-dependent codes as, at a first stage, it is calibrated and, at a second stage, when they are already calibrated, they can search a particular profile.
In the proteomics and genomics fields, there are many different algorithms designed under this assumption.
3.4.2 Non-Supervised Algorithms
A non-supervised algorithm is a computational code that does not require calibration or training to know what to look for and, if it requires it, it is only a part of the code and it modifies itself to adjust the search criteria. The running of these codes does not depend on the programmer as they are independent.
In the proteomics and genomics fields, there are also these types of algorithms and although they are less, they are very useful.
In this chapter, we will use an algorithm of this type named Polarity Index Method®, to explore SARS-CoV-2 structural proteins.
3.5 Polarity Index Method®
The non-supervised algorithm named Polarity Index Method® (PIM®) is a system programmed in FORTRAN 77 and Linux. It calculates and compares the PIM® protein profile of the target group with other groups, modifying the PIM® profile of the target group to make it representative and discriminant of the other protein groups it is compared with.
3.5.1 The PIM® Profile
The metrics of the PIM® profile consist to evaluate the 16 charge/polarity interactions identified by reading the sequence of a protein by pairs of residues, from left to right. The PIM® system has three stages:
1 1. The amino acid sequence is converted to the numeric charge/polarity-related annotations P+, P−, N, and NP, where P+ are H, His; K, Lys; and R, Arg; P− are D, Asp; and E, Glu; N are C, Cys; G, Gly; N, Asp; Q, Gln; S, Ser; T, Thr; and Y, Tyr; and NP are A, Ala; F, Phe; I, Ile; L, Leu; M, Met; P, Pro; V, Val; and W, Trp.
2 2. The sequence is expressed in FASTA format; all the incidences of these pairs of amino acids are registered in a 4 × 4 algebraic matrix where its rows and columns are the four PIM® profile groups. Once all amino acid pairs are recorded, the incidence matrix is normalized.
3 3. Create a 16-element vector putting, from left to right, the 16 possible positions from the incidence matrix in increasing or decreasing order. Two proteins are equal if their 16-element vectors are the same.
Two proteins are equal if their 16-element vectors shared the same preponderant function.
3.5.2 Advantages
The main advantage of this method is that the metric acts on the linear representation of the protein and not in the three-dimensional representation of it, making possible a simple analysis. On the other hand, only one physico-chemical property is evaluated, the polarity/charge of the protein.
The analysis is comprehensive as the full spectrum of PIM® profile incidents is examined; in other words, the PIM® profile is not a number but a 16-element vector. Thus, two proteins have the same PIM® profile if their 16-element vectors are equal.
3.5.3 Disadvantages
Its use as part of a biochip is not yet completed; this restricts its use only to determine the predominant function of a protein; however, nowadays, it is not enough to identify the function/structure of a protein but to identify it in the blood of an organism and determine its number.
This will enable the PIM R profile as a rapid detection test.
3.5.4 SARS-CoV-2 Recognition Using PIM® Profile
The PIM® system (Section 3.5.1) was determined in the four SARS-CoV-2 structural proteins: spike, membrane, envelope, and nucleocapsid (Section 3.3.3), and their smooth curves (Figure 3.1) were plotted, it was observed that there is a similarity in these PIM® profiles, except for the region between the polar interactions [P−, P−] and [N, P−] see (Figure 3.2).
Figure 3.1 Relative frequency distribution of proteins that express the four SARS-CoV-2 structural viral protein group represented by “smooth curves”. Graphs were produced using EXCEL software. The X-axis represents the 16 charge/polarity interactions. The ellipse shows the region where curves do not match the trend.
Figure 3.2 shows the PIM® profile of the spike and envelope proteins behaving particularly differently, while the membrane is the translation of nucleocapsid.
A revision of the histograms (Figure 3.3) of the relative frequency distribution of the residues in the sequences of the SARS-CoV-2 structural proteins (spike, envelope, membrane, and nucleocapsid) shows that any of them are similar. When in general, this behavior does not necessarily depend on the length of the sequence.