Gene3D provides predicted CATH structural domain annotations for major protein sequence databases and sequence genomes using an accurate and sensitive automated homologue recognition protocol.

Structural representatives (S-reps) from CATH superfamilies are selected as seeds for an iterative search process to identify homologues. On average each superfamily has four representatives, though some have many more and many only have a single representative. These are aligned using MAFFT and refined using an internal protocol, and each resulting multi-sequence profile is used to construct a Hidden Markov Model (HMM) using the HMMER3 package. Query sequences are then searched against this HMM library to produce a set of potential domain assignments. The potentially complex set of overlapping matches is then resolved into a single domain architecture with confident domain boundaries using an in-house protocol - DomainFinder. This represents the matches as nodes in a network weighted by E-value, and selects the maximally weighted clique. In essence this represents the set of matches that represent the most likely combination and position of domains predicted by the model library.

Group Leader:Christine Orengo