Overview - Flexible Structural
Neighborhood(FSN) by FATCAT
Want to see a simple workflow of FSN server first?
The FATCAT Structure Neiborhood server includes a database of precomputed alignments between PDB proteins,
as calculated by a flexible protein structure alignment program - FATCAT.
The database, searchable by a protein PDB code,
provides a list of proteins with statistially structural similarities and
on lower menu levels it provides detailed alignments, interactive superposition of structures and positions of hinges
that were identified in the comparison.
Protein Structure Database and the Framework
Many PDB proteins consist of multiple domains.
Protein domains are considered to provide more detailed functional information than protein chains.
However, the domain annotation of the PDB proteins is usually far outdated comparing to the PDB's update.
In order to keep our server both up-to-date and most informative in aiding functional annotations,
our protein structure database is combination of protein structures in chain-level and in domain-level.
We use SCOP,
one of the most recognized protein structure classification database, to define protein domains.
The framework of the FATCAT Structure Neighborhood Server is shown in
For all proteins in PDB, they are divided into two groups, new PDB entries and PDB entries annotated by SCOP,
based on whether a PDB entry is annotated by SCOP.
A pool of protein structures in domain level is constructed from the PDB entries that annotated by SCOP,
excluding PDB chains in which no domains are identified.
Also protein domains from classes h, i, j and k (not true classes) are not included in the structure database.
On the other hand, protein structures in chain level only include protein chains from new PDB entries that are NOT annotated by SCOP,
except protein chains that are defined as trivial. Thus, for any PDB protein identified by PDB code,
its structure(s) is/are
included in the searchable database either in domain format if it's in SCOP, or in chain format otherwise, but not in both.
The protein structure pool including both protein domains and protein chains constructed above is rather too big.
As of PDB up to April 25, 2005 and SCOP1.67, there are 61,775 protein domain structures and 12,126 new protein chains.
Itís neither practical nor necessary to calculate structure comparisons between all these structures (73,901)
because most of these proteins are redundant
Therefore, we only calculated structure comparisons for a representative set (selected chains + selected domains in Figure 1)
of the protein structure pool at 90% sequence identity.
The structural neighborhood for the rest of the structures (filtered chain + filtered domain in Figure 1) is referred
by their representative structures. To be specific, the selected domain and filtered domains are generated by
at 90% sequence identity from the protein domain structures of SCOP.
The selected chains from new PDB entries are generated in two steps.
First, it uses blast to filter out the new PDB chains
that have ≥ 90% sequence identity with any PDB chains that are annotated by SCOP.
Second, it uses cd-hit to cluster the rest of the new PDB entries at 90% sequence identity.
Therefore, the selected chains are at < 90% sequence identity to themselves and to the PDB chains that are in SCOP.
An all-by-all flexible structure comparison is calculated using FATCAT on all structures in this
representive set, including both selected domains and selected chains. The set of selected chains is called
NEWPDB_90 and the set of selected domains is called SCOP1XX_90(1XX: scop version number)
in the database selection field of the
server's page. The structural homologs found by the server for any given protein are from SCOP1XX_90,
NEWPDB_90 or both, depending which set of structures the user chooses.
The FSN server accepts either a protein (PDBID) or a domain (SCOP ID) as a query
(Figure 3: A). For the former case, the server
first displays the information of chains and domains of a given protein. Afterwards, users can retrieve
similar structures for a domain (if domain information is available, i.e. the protein is collected by SCOP),
or for a chain otherwise. Basically these are the two main stages of the output for a query. The first stage
is to display the protein structure summary page and the second stage is to display the structure neighborhood page.
However, when a query is a domain, the first stage is skipped and the structure neighborhood page is displayed
directly. If one wants to see other chains and domains that share the same PDB id as the query domain,
he/she has to input the PDB id of this domain from the FSN homepage.
The protein structure summary page
(Figure 3: B) will graphically display the constitutent chains and domains for a
protein identified by a PDB code and the relationship among them, such as identical chains,
domain compositions along the chains's primary structures, especially for domains that are composed of multiple
fragments. Moreover, for PDB proteins that are not annotated by SCOP, the page will show the predicted domain
compositions if they have homologs with PDB proteins that are in SCOP at 90% sequence identity.
In the Structure neighborhood page
(Figure 3: C), a list of similar structures of the query is display in the increasing
order of P-values. By clicking on the
one can interactively view the details of a individual alignment
(Figure 3: E).
Users can retrieve a custmomized set of structure homologs by changing alignment parameters,
such as P-value, the number of hinges etc. For example, one wants to study the structure homologs that
have certain number of hinges according to the query structure. It also provides the graphic view of the
flexiblity of a structure as comparing with its structure homologs
(Figure 3: D). but they can also download
and save the results in a variety of formats.
Statistical Estimation of the Structure Similarity
It is suggested that the statisical estimate of the flexible alignments is P-value &le 0.05[Ye and Godzik,
However, Flexible FATCAT is more sentitive but less accurate at that P-value for all-alpha proteins than for all-beta proteins.
Other proteins with both alpha and beta elements are in between
(Figure 4). Therefore, we suggest that
a more strict significance estimate with P-value ≤ 0.01 for all alpha proteins should be used,
while a estimate with P-value ≤ 0.05 is used for other proteins.
"trivial chain": A protein chain is considered trivial if it has less than 30aa,
or the number of non-standard amino acid is less than 60% of its length and the number of standard amino acid is less than 100aa.