Our understanding of the importance of lysine post-translational modifications in mediating protein function has led to a significant improvement in the experimental tools aimed at characterizing their existence. Nevertheless, it remains likely that at present we have only experimentally detected a small fraction of all lysine modification sites across the commonly studied proteomes. As a result, online computational tools aimed at predicting lysine modification sites have the potential to provide valuable insight to researchers developing hypotheses regarding these modifications. This chapter discusses the metrics and procedures used to assess predictive tools and surveys 11 online computational tools aimed at the prediction of the four most widely studied lysine post-translational modifications (acetylation, methylation, SUMOylation and ubiquitination). Analyses using unbiased testing data sets suggest that nine of the 11 lysine post-translational modification tools perform no better than random, or have false-positive rates which make them unusable by the experimental biologist, despite self-reported sensitivity and specificity values to the contrary. The implications of these findings for those using and creating lysine post-translational modification software are discussed.
The chapter titles in this volume illustrate the depth and diversity of cellular processes mediated through the PTM (post-translational modification) of lysine residues. Given the importance of lysine-based modifications (acetylation, SUMOylation etc.) in controlling protein activity and, in turn, affecting human disease, it follows that characterizing lysine PTM states is fundamental to comprehensively understanding protein function.
Although our ability to detect and accurately localize protein PTMs on lysine residues has improved significantly in the past several years due to enhancements in the techniques and reagents used to enrich for low-abundance modification sites from complex mixtures, and in tandem MS (mass spectrometry) instrumentation, it remains likely that we have as yet uncovered only a minute portion of the complete lysine ‘modifyome’. Even for a single protein, testing every lysine residue for all possible modifications is typically prohibitively costly and laborious. As such, computational predictive techniques may serve as the first line of hypothesis generation prior to carrying out experimental work. While the computational techniques for PTM prediction can be relatively complex, often predictive algorithms are translated into user-friendly online web tools whereby experimentalists can upload their protein sequences and receive lysine modification prediction results within minutes. Caution should be taken, however, as this chapter will show that it is advantageous for the user to carry out simple tests to assess the predictive capacity of an online bioinformatic tool prior to incorporating it into an experimental pipeline.
What follows is a discussion of the creation and assessment of PTM prediction algorithms, as well as an overview of the current online bioinformatic tools aimed at the prediction of acetylation, methylation, SUMOylation and ubiquitination sites in proteins.
Strategies of prediction: from consensus sequences to machine learning
Imagine you are given the sites and surrounding protein sequences from ten instances of a new lysine PTM called ‘randomylation’. Your task is to find additional randomylation sites. Since you are on a tight budget and have limited time, you decide to attempt to first predict potential randomylation sites in an effort to direct your later wet-lab experiments. How might you accomplish this task?
Among the earliest strategies for PTM prediction were the generation of consensus sequences (also known as motifs), which effectively transformed the information contained in sequence alignments into character-based patterns which could then be scanned against a protein of interest . In Figure 1, taking the most frequent residue at each position (i.e. column) in the alignment results in the consensus sequence ECLAkDELK (where lowercase k represents the modified amino acid). A ‘back of the envelope’ calculation (assuming equal amino acid frequencies for the 20 residues and independence of all nine positions) indicates that this sequence should appear once every 209 amino acids (several orders of magnitude larger than the entire human proteome!). To handle this level of stringency, consensus sequences can be created which allow for multiple residues at a particular position and may only include residues exceeding a frequency threshold. Setting a frequency threshold of 60% in the example above results in the consensus sequence CxAkxE (where x denotes any amino acid). It is important to note, however, that only two of the ten sequences contain this full consensus and that one of the ten sequences contains no aspects of the consensus other than the central modified residue.
To overcome the deficiencies of consensus sequences and to allow for the incorporation of additional variables into the prediction methodology, supervised machine learning algorithms have become increasingly used for the prediction of lysine PTMs over the past 5 years. Classification algorithms, such as decision trees, neural networks, hidden Markov models and SVMs (support vector machines), typically use positive and negative training data and their associated features to refine functions aimed at categorizing unknown data. In the case of modification prediction, this translates to using modified (true positive) and unmodified (true negative) sites along with developer-selected sequence features (amino acid, surface exposure, helicity etc.) to predict as yet unknown modification sites. Owing to their widespread success as classifiers in other fields, SVMs have been heavily used in the most recent lysine PTM prediction literature. Briefly, SVMs map categorical training data with a large number of features in a multi-dimensional space and attempt to find the decision surface that maximally separates the data. New data may be categorized by determining on which side of the surface it falls. A two-dimensional example is shown in Figure 2.
Assessing predictive performance
The predictive performance of lysine PTM software is most commonly assessed using two important metrics, Sn (sensitivity) and Sp (specificity). These metrics are calculated using the equations: (1) (2) where TN, TP, FN and FP stand for the number of true negatives, true positives, false negatives and false positives respectively. In simpler terms, sensitivity represents the fraction of correctly designated positive predictions (i.e. modified lysine residues) over the total number of positive sites in a test data set, whereas specificity represents the fraction of correctly designated negative predictions (i.e. unmodified lysine residues) over the total number of negative sites in a test data set.
It is critical to note that sensitivity is calculated by only taking into account the positive (post-translationally modified) data and specificity is calculated by only taking into account the negative (post-translationally unmodified) data. Because it is expected that the vast majority of lysine residues in most proteins will remain in an unmodified state, the specificity level reached by an algorithm is often the most important metric for researchers wishing to obtain accurate results from a predictive tool. Thus tools with specificity levels much lower than 90% are likely to be overwhelmed by false positive predictions (see the example below).
Example: the importance of specificity
Assume we have a protein with 50 lysine residues. Of these 50 lysine residues, assume we know a priori that ten are post-translationally modified and the remaining 40 are unmodified. An algorithm with a 30% sensitivity and a 95% specificity would correctly predict three out of the ten positive sites, and would incorrectly predict two out of the 40 negative sites. Thus a researcher wishing to experimentally validate all of the algorithm's positive predictions would find that 60% (three out of five) were correctly predicted as modified. If, however, the algorithm had a 30% sensitivity and a 75% specificity, then three out of the ten positive sites would still remain correctly predicted, but ten out of the 40 unmodified sites would be incorrectly predicted. In this case, a researcher wishing to experimentally test all of the algorithm's positive predictions would find that only 23% (three out of 13) are correct (only marginally better than randomly picking lysine residues, which would result in a 20% success rate). Thus, due to the expected large inequality between modified and unmodified sites (typically even higher than the present example), achieving a high specificity level is imperative for any biologically relevant predictive tool.
For published predictive tools, the determination of sensitivity and specificity values is usually carried out through a cross-validation procedure. This procedure involves setting aside a certain percentage of the total data set to be used in the evaluation of an algorithm's performance metrics. Data are thus divided into a ‘training set’ (used for the refinement of algorithm parameters) and a ‘testing set’ (used for the calculation of sensitivity and specificity). In ‘k-fold’ cross-validation, the procedure of removing a percentage of the total data set is repeated k-times (each time using a different subset of the data) and an average sensitivity and specificity is reported. It is critically important, however, that data from test sets are not included in any of the algorithmic refinement procedures. Unfortunately it is not uncommon for researchers to determine important data features to be included in a predictor using the complete data set (i.e. training and testing data combined). This procedure typically results in the (inadvertent) inclusion of testing data in the feature selection procedure, and therefore leads to an overestimation of sensitivity and specificity values when cross-validation is performed. In these cases, the algorithms are said to be ‘over-fit’ to their data and therefore the reported values of sensitivity and specificity are not reflective of the true sensitivity and specificity when non-training data are provided. One option for selecting important data features without biasing an algorithm involves splitting the total data into three subsets. The first and second subsets should then be used to determine important data features to be included in the algorithm, while the third subset should be used only to determine the sensitivity and specificity of the approach.
Thresholds and ROC (receiver operating characteristic) curves
While computational biologists strive to develop predictors with high sensitivity and high specificity, it is intuitive that the relationship between sensitivity and specificity involves a tradeoff. That is, predictors may capture a large proportion of true modification sites, but include a significant number of false positive sites (high sensitivity, low specificity); or alternatively, they may limit the number of false positive sites at the expense of capturing a certain proportion of true modification sites (high specificity, low sensitivity). The discussion of sensitivity and specificity calculations in this section has assumed that predictors provide simple ‘yes/no’ answers for potential modification sites; however, in reality, prediction algorithms typically provide a score associated with each potential modification site. Thus it is possible, by imposing a variety of score thresholds, to plot the general relationship between sensitivity and specificity for a given predictor. This plot, or more precisely, the plot of sensitivity against the false positive rate (1 − specificity), has been historically referred to as a ROC curve, and is the most widely used metric to assess the performance of PTM predictors (see Figure 3). To provide users with the ability to predict modification sites at various points along the ROC curve, it is common for online prediction tools to have a user-adjustable threshold parameter corresponding to several predefined levels of sensitivity and specificity.
Computational tools for lysine modification prediction
To be included in this chapter, a bioinformatic tool for lysine PTM prediction had to meet the following criteria: (i) a description of the tool and algorithm needed to be published in a peer-reviewed journal, (ii) the algorithm needed to have a corresponding web application whereby users could upload data and receive real-time results, (iii) the online web application needed to be functional at the time of chapter preparation, and (iv) suitable testing data to assess the performance of the predictor needed to be available.
Not surprisingly, the number of predictive tools for each lysine modification type was approximately proportional to the amount of data available for the particular modification. Table 1 provides an overview of the various predictive tools assessed in this chapter.
Performance of online tools for lysine modification prediction
While each of the published online tools shown in Table 1 have self-reported their sensitivity and specificity values, often these values can deviate significantly from the actual sensitivity and specificity of the predictor for two reasons. First, developers of predictive software can mistakenly include testing data in the refinement of their approach, leading to an ‘over-fit’ predictor (see the section on cross-validation above). Secondly, as new lysine modification data are added to the literature using new experimental techniques, the overall characteristics of the data may change. For example, just over a decade ago the majority of known lysine acetylation sites were found on nuclear proteins, whereas today a large fraction of acetylation sites are known to occur on cytosolic and mitochondrial proteins . It has been shown that acetylated proteins from different cellular compartments have different acetylation motifs [3,4]. Thus a predictor created solely with data available 10 years ago would probably perform poorly on a random sampling of acetylation sites available in the literature today.
Testing methodology and results
In order to fairly and adequately assess the predictive performance of the lysine modification prediction tools listed in Table 1, predictive tools were subjected to random sets of positive and negative lysine PTM data from the recent literature. As described below, when possible, data used to train a predictor was excluded from the random testing data used for this evaluation.
To assess the PAIL , scan-x , PredMod , LysAcet  and N-Ace  lysine acetylation predictors, a test data set was created by choosing random acetylated proteins from the Choudhary et al.  proteome-wide acetylation study (a study whose data were not used to train these acetylation predictors). All lysine acetylation sites within these proteins were tabulated, and overlapping acetylation sites that were found in the PhosphoSitePlus acetylation database  (a comprehensive PTM database) were removed to avoid the possibility of including training data in the testing set. For each protein an equivalent number of lysine residues not known to be acetylated were randomly chosen and recorded to create an equivalently sized negative data set. In total, the data set consisted of 24 proteins with 71 acetylated lysine residues and 71 non-acetylated lysine residues. The sequences of each of the 24 proteins were uploaded to each predictive tool's website and prediction results for the specific acetylated and non-acetylated lysine residues were tabulated. The sensitivity and specificity values were calculated as described in eqns (1) and (2) above. To assess the Phosida  predictor (which was trained using data solely from Choudhary et al. ), acetylation and non-acetylation data from the PhosphoSitePlus database not included in Choudhary et al.  were randomly collected using the same procedure as described above. The Phosida test set consisted of 30 proteins with 70 acetylated and 70 non-acetylated lysine residues (the difference in site numbers for the data sets resulted from the fact that random proteins were added until the total number of sites reached approximately 70). Equivalent procedures were used to collect data and test predictors for the remaining PTMs. In most cases, recent modification data were downloaded from the PhosphoSitePlus database with care being taken to not include data used to train the predictors (when such information was known). All testing data sets, as well as detailed results of these analyses, may be obtained from the author of this chapter.
Unfortunately, the tested values of sensitivity and/or specificity for the vast majority of web tools for lysine modification prediction were far below the self-reported values obtained from their respective papers (see Table 1). Figure 4 shows a graph of sensitivity against the false positive rate (1 − specificity) for each of the predictive tools at common predefined thresholds available to users online. It should be noted that, on such a graph, the diagonal represents a random predictor and the upper left-hand corner represents a perfect predictor (i.e. 100% sensitivity and 0% false positives). The only predictors with false positive rates useful to the experimental biologist (<10%) and a reasonable distance above the diagonal were scan-x and SUMOsp 2.0. Of the lysine PTMs with prediction tools available, only SUMOylation has a clearly defined motif, [ILVMF]KxE . Assessing the usefulness of SUMOsp 2.0 required comparing its performance relative to the standard SUMOylation motif as well as a more general form of the motif (KxE). While the [ILVMF]KxE motif had a false positive rate of 0%, its sensitivity (61.4%) was lower than that of SUMOsp 2.0 which obtained a sensitivity of 67.1% at a false positive rate of only 1.4% (under the ‘high’ threshold). However, the more general form of the SUMOylation motif, KxE, exceeded the performance of the SUMOsp 2.0 predictor under both low and high thresholds (see Figure 4), suggesting that, at present, the best way to predict SUMOylation sites in proteins is to simply search for KxE motifs.
Conclusions and perspectives
While the development of computational tools for the accurate prediction of lysine PTMs is a worthy goal given the importance of these modifications in modern molecular biology and the incomplete state of our present knowledge, the findings presented in this chapter of 11 online prediction tools spanning four lysine PTMs suggest that the majority of these tools perform no better than randomly picking lysine residues from proteins of interest and flipping a coin to predict their modification state. There are several major implications of this chapter. First, it is good practice to check the performance of any predictor with known data prior to using it in an experimental pipeline. Ideally, one should compile a set of at least ten positive and ten negative modification sites (preferably not used in the predictor's training set) to verify predictor performance. In general one should be wary of a predictor that reports high values of sensitivity and specificity, while only including sequence specific information into the prediction methodology. The action of post-translational modifying enzymes (e.g. acetyltransferases, E3 ligases etc.) are highly spatially and temporally regulated; thus the addition of these non-sequence-based variables will probably be required to achieve sensitivities above 50%, while maintaining specificities relevant to the biologist. Secondly, researchers in the field of PTM prediction should be cautious regarding the test sets they use to derive their reported sensitivity and specificity values. Often it is best to divide data into three subsets. The first and second subsets may be used to train and optimize the predictor. Only after all optimization is complete should one use the third subset to determine the true sensitivity and specificity of the method. Finally, the aforementioned issues associated with lysine PTM prediction should not discourage new researchers as they in fact suggest that fresh perspectives and clever ideas can lead to big improvements in this relatively young and exciting field!
• The modification state of lysine residues is important for understanding protein function at the molecular level.
• Experimental methodologies for the discovery of lysine PTMs have improved significantly, yet the vast majority of modified lysine residues remain to be discovered.
• Online computational tools for the prediction of lysine modifications, if accurate, have the potential to provide the first line of hypothesis generation to researchers studying protein function.
• Sensitivity and specificity are determined through a cross-validation procedure, and are the two most common metrics used to assess predictive tools.
• For the experimentalist, specificity is often more important than sensitivity.
• If performing cross-validation, care should be taken to only use training sets for algorithm optimization and only use testing sets for assessment of predictive performance.
• The majority of current online lysine PTM predictors perform no better than randomly.
• Before using an online predictive tool to develop experimental hypotheses, it is a good idea to perform a quick test with a known data set that was not used to train the predictor.
• Scanning a protein sequence for the motif KxE is the best way to predict potential SUMOylation sites.
• Given the poor performance of nearly all lysine PTM predictors, it is a good time to make a big dent in a small field!
I thank Natalie Schwartz and Joshua Lubner for their critical review of the chapter and Ahmet Mingir for his assistance with the creation of randomized PTM data sets used to assess predictor performance.
- © The Authors Journal compilation © 2012 Biochemical Society