Prediction of intrinsically disordered regions and protein binding sites within them

Sometimes you will find no reliable three dimensional model of the protein structure you are interested in. This may be due to a number of different reasons, such as the lack of suitable templates or the limits of fold recognition methods to identify them. In the case of intrinsically disordered proteins or protein regions, the amino acid chain haa a relatively flat energy landscape, and so it samples a broad set of conformations that cannot be reasonably approximated by a single structure. Let’s have a look at the Genome3D entry for the human SNW domain-containing protein (also called nuclear protein SkiP) - please open this link in a new tab or page:  Q13573

The ‘Predicted 3D structure’ section in the ‘Annotation’ tab shows that only one prediction by the VIVACE pipeline is available, but it covers only 80 residues out of 536 as a long helix. Regardless of the accuracy of such prediction, clearly it is not that useful to figure out the biological role of this protein.

In fact, SkiP appears to be largely disordered in vivo and the figure below highlights some distinguishing features. The golden line shows the average hydrophobicity score calculated over a window of 9 residues; positive values indicate more hydrophobic regions (that would promote folding), while negative values correspond to more hydrophilic segments. Charged amino acids are also highlighted with red (positively charged) and blue (negatively charged) vertical bars. The high proportion of hydrophilic and charged residues is typical of intrinsically disordered proteins and regions.

Let’s use DISOPRED to analyze the amino acid sequence of SKiP, obtain residue-level disorder predictions and detect protein binding sites within them. Click on the 'PSIPRED' button under the 'Associated tools' heading and you will be redirected to submission form of the PSIPRED webserver. Please make sure you tick the 'DISOPRED3 and DISOPRED2' box for this tutorial. You are welcome to run the available tools against other proteins in the future, but we kindly ask you not to run other analyses during this tutorial due to the limited time available. Press 'Predict' to run the job.

The results summary page recaps the prediction output and maps it onto the input sequence as per the key. At the top of the page you can usually see the short identifier you provide for the job and the unique private ID assigned by the PSIPRED server. Below this, separate tabs allow you to view the specific outputs for each tool that was run. For the purpose of this tutorial, all analyses have been run in advance and the different tabs show the pre-calculated results; the identifier at the top of the page is therefore set to 'test_Q13573'.

The large majority of amino acids is predicted to be disordered (and highlighted in red boxes), and some are also likely to bind other proteins (shown in green boxes). The ‘DISOPRED‘ tab gives a graphical representation of more detailed information about disorder predictions. The disorder profile plot shows the DISOPRED3 disorder confidence levels against the sequence positions as a solid blue line. The grey dashed horizontal line marks the threshold above which amino acids are regarded as disordered. For disordered residues, the orange line shows the confidence of disordered residues being involved in protein-protein interactions. Disordered amino acids are predicted to form protein binding sites when the confidence scores are larger than 0.5. The 'Downloads' tab allows you to save locally the results of the analysis both in graphical and text format.

How reliable are these data? Using NMR spectroscopy, a recent study showed that positions 1-172 are disordered in isolation and that the segment spanning positions 59-79 folds upon binding the protein PPLI (Wang X et al. "A large intrinsically disordered region in SKIP and its disorder-order transition induced by PPIL1 binding revealed by NMR." J Biol Chem. 2009). DISOPRED correctly classifies approximately 65% of the 172 N-terminal disordered residues, and this approximately mirrors the accuracy levels achieved during the independent CASP benchmarking experiments. Given the lack of experimental data, we cannot make defintive statements about the quality of the other predicted disordered regions, but these appear to be consistent with common assumptions and with consensus data in external resources such as MobiDB and D2P2. The disordered protein binding site from position 59 to 79 is predicted with 38% precision.

Prediction of protein function from sequence (FFPred), and helical packing arrangement for transmembrane proteins (MEMPACK)

In order to try some other tools among those provided by the main PSIPRED server, let's now consider a very different protein.

Human Rhodopsin is one of the proteins responsible for the perception of light in our species. It is a transmembrane protein, belonging to the G-protein-coupled receptor (GPCR) family.

As usual, let's start by visiting the Genome3D page for Rhodopsin by opening this link in a new tab or page:  P08100

Click on the 'Annotations' tab. The page summarizes the structural information in the way we are already familiar with. Once again, we can retrieve the PSIPRED server home page by clicking on the 'PSIPRED' button under 'Associated tools' at the bottom of the page.

Notice how the PSIPRED server page has been already filled with the protein's amino acidic sequence. The tool is ready to run. As before, we will visit a cached result for this protein during this Workshop. In real usage, you need to remember to provide a short identifier for your PSIPRED jobs ('Short identifier for submission'); today, the identifier will be automatically modified to 'test_P08100'. Please remember that you are kindly requested not to alter the submission sequence for the purpose of this tutorial - however, please feel free to use these tools as you wish in the future.

The PSIPRED suite contains a tool for prediction of protein function directly from amino acidic sequence, using limited or possibly no homology information. This tool is called FFPred, and its latest version (v2.0) can be used if you click on the corresponding checkbox under 'Choose Prediction Methods'. Moreover, the suite includes a tool for predicting the transmembrane helical packing arrangement (MEMPACK): please select this checkbox as well. You may also want to de-select the PSIPRED checkbox, that is usually ticked by default.

After clicking 'Predict' you are redirected to the cached result for this protein (actually running the job would be time consuming). On the results page, the tabs we are interested in are those named after the tools we just mentioned.

The 'FFPred' tab shows the output of FFPred 2.0 for human Rhodopsin. The top section of the output includes two tables, for the two Gene Ontology (GO) domains of 'Biological Process' and 'Molecular Function'. The tables list a series of "GO terms" belonging to those GO domains, that have been predicted to be annotated to Rhodopsin; GO terms are the standard way to annotate functional characterisation to proteins. Each line in the tables contains the GO term and its description, followed by the posterior probability of the prediction being correct and, finally, an indication of the overall reliability, high (H) or low (L), of that particular GO term.

In order to understand the output, we need to know a bit more in detail how FFPred achieves its predictions. The input protein sequence (Rhodopsin in our case) is first analysed by FFPred, which runs a series of prediction tools and extracts biologically relevant "features" of the protein - for instance, the number of alpha helices (if any), the average hydrophobicity of the protein and many more. We'll see this in more detail in the next paragraph. Then, for each different GO term in its vocabulary, FFPred runs a Support Vector Machine (SVM) on Rhodopsin's set of features, and by doing so it compares Rhodopsin with sets of proteins for which the functional characterisation for that particular GO term is known. This allows FFPred to give back the probability (indicated in the tables) that Rhodopsin actually is annotated with each GO term - the higher the probability, the "safer" the prediction.
Finally, GO terms that are in general very harder to predict, and therefore always included only as speculations on possible functional characterisation, are included as low (L) reliability predictions, on a red background, while all other GO terms are considered highly reliable (H), and are always shown at top of table, regardless of the predicted probability. This simply means that "red" GO terms are always to be considered less reliably predicted than the others, even though they may be predicted with a high probability for this particular protein, as happens for "cellular protein modification process" in this case.

In our case, looking at highly reliable GO terms only, we can see how Rhodopsin is correctly predicted to be annotated with GO terms like "G-protein coupled receptor signaling pathway", "detection of stimulus" (Biological Process), "G-protein coupled receptor activity", "signal transducer activity" (Molecular Function) with very high probability values. This should not be surprising, as Rhodopsin itself is a well-known example of such properties and this has therefore been recognised by the corresponding SVMs. Other examples of predicted GO term annotations however have lower probabilities, and can be interpreted as suggestions of further functional characterisation that may be tested in experimental work.

The bottom section of the 'FFPred' tab includes indications of which "features" of Rhodopsin's sequence have been used to obtain the predictions. Features range from structural features (remember, however, that no structural data is used: these are computationally predicted features), disorder, post-translational modifications, PEST regions, amino acid composition, physico-chemical properties of the protein. In particular, some of these are actually predicted using some of our other tools - PSIPRED is used to predict secondary structure, DISOPRED is used for disorder and MEMSAT-SVM is used to predict transmembrane helices and their topology.

Lastly, let's focus on the transmembrane helices that are predicted for Rhodopsin. You can see the 7 predicted helices both in the cartoon on this tab, as mentioned in the previous paragraph, or directly clicking the results tab for MEMSAT-SVM, which is the tool that was used to obtain such prediction. It is well known that Rhodopsin, as a GPCR, exhibits 7 transmembrane helices - however, can we say any more about the arrangement of these helices in the lipid bilayer?

Clicking on the 'MEMPACK' results tab shows the prediction for the Rhodopsin helical packing arrangement made by this other tool within the PSIPRED server suite. The seven predicted helices are depicted in a diagram in the most likely predicted conformation. Lines connecting residues on the different helices represent interactions that are thought to make such conformation stable and the most favoured one.

This diagram can be downloaded directly by clicking on it; many more useful analysis files for Rhodopsin, including those containing FFPred output details, can be downloaded by visiting the 'Dowloads' tab of these results pages.