MAFFTash
MAFFTash URL: http://sysimm.org/MAFFTash//
[Overview]
MAFFTash is a server that calculates multiple sequence alignments from sequences and structures. It consists of two existing programs, MAFFT and ASH. ASH is a structural alignment program that utilizes an extension of the double dynamic programming algorithm to maximize the number of structurally equivalent residues between two proteins [1-3]. The pairwise structural alignments are then subjected to MAFFT, a widely-used multiple sequence alignment program [4-9]. MAFFT uses the structural alignments to construct an overall multiple alignment that is consistent with the pairwise structural alignments as much as possible. Sequence homologs with no structural information can also be included in the alignment.
[Usage]
MAFFTash is a server that calculates multiple sequence alignments from sequences and structures. To run MAFFTash you must provide a list of sequences and/or PDB and chain identifiers. The list may be pasted into the text area or uploaded from an external file. In either case, the sequences must be input in FASTA format, and the PDB and chain identifier must be joined as a string of length 5 (e.g. 1nagA). Each PDB and chain identifier line must be proceeded by a line containing the string 'PDBID' and nothing else. For example:
>PDBID 3ygsC >Q6Q899|DDX58_MOUSE| 1-91 MTAAQRQNLQAFRDYIKKILDPTYILSYMSSWLEDEEVQYIQAEKNNKGPMEAASLFLQY LLKLQSEGWFQAFLDALYHAGYCGLCEAIES >Q6Q899|DDX58_MOUSE| 101-176 EEHRLLLRRLEPEFKATVDPNDILSELSECLINQECEEIRQIRDTKGRMAGAEKMAECLI RSDKENWPKVLQLALE >PDBID 2p1hA
is valid input. Note that chain identifiers are now mandatory for all PDB entries. Whitespaces (' '), dashes ('-'), and underbars ('_') are not acceptable chain identifiers. If you are uncertain about which chain IDs to use, please use PDBj Mine (the PDBj search engine). Type in your PDB ID, then click on 'sequence information (FASTA format)'. You will see the PDB sequence for each chain in FASTA format. Note also, that MAFFTash provides a tool for automatically picking up a set of PDB IDs, given a set of (FASTA-formatted) sequences. To use this feature, click 'Prep-MAFFTash' under the Example on the MAFFTash top page.
You are not limited to PDB entries and may provide your own PDB-formatted structures. To upload your own structures, first specify the number of files to be uploaded and a new form will be generated. The ‘Structure weight’ (default value .2) controls how much influence ASH has on the MAFFT alignment. Different values might need to be experimented with, depending on the ratio of structures to sequences.
[Methods]
MAFFTash works by first aligning all pairs of structures using a modified version of the program ASH, then extracting the aligned residue pairs and constructing a multiple sequence alignment of all sequences with a reward for the structurally aligned residue pairs.
Structural alignment
The ASH was modified so that each structure is first partitioned into domains using Protein Domain Parser, then all pairs of domains are aligned using conventional ASH. Finally, a complete pairwise alignment of the whole structure is formed from a dynamic programming calculation constructed from the complete set of domain-domain alignments. In this way, the ASH alignment is 'rigid' within domains but 'flexible' between domains.
Sequence alignment
A multiple sequence alignment is computed using a modified version of the program MAFFT.
[Prep-MAFFTash]
MAFFTash provides a tool for automatically preparing valid MAFFTash input from a limited set of sequences or PDB IDs. To use this feature, click Prep-MAFFTash under the Example on the MAFFTash top page.
The Prep-MAFFTash entry form looks like the MAFFTash page. There is a text window where sequences and/or PDB IDs can be pasted; however there are a number of additional options. These are grouped into three sections:
-
Add structures. This feature will use BLAST to search the PDB using and input
you type in the text box as a query. There are three parameters that control what Prep-MAFFTash
retrieves:
- a. Max seq ID between added structures (default 90%). This parameter prevents many instances of a particular structure from being retrieved. If you want fewer structures, lower the value; if you want more, increase it; using 100 will add all PDB entries that are homologous to your input. The pruning of sequences is performed using the program cd-hit [10]
- b. Min seq ID from original input (default 20%). This parameter controls what BLAST considers a sequence homolog. Increasing this parameter will reduce the number of PDB entries retrieved; decreasing it will increase the number retrieved. However, an internal parameter prevents PDB entries with e-values0.01 from being included.
- c. Min coverage of original input (default 50%). This parameter determines how much of particular PDB entry must ‘cover’ the query sequence. Ideally, the structure would cover all or most of the query; it it does not, you might consider breaking your query sequences into domains.
-
Add ASH structural neighbors. This feature allows you to pull in structural
homologs to your query sequence(s). We maintain a database of ASH structural alignments. If one
or more of your queries can be matches to one or more of the structures for which pre-computed
alignments are available, the list of structural ‘neighbors’ can be added subject to the
following constraints:
- a. Max seq ID between added structures (default 90%). This parameter is analogous to 1.a (above) except that it applies to the ASH structural neighbors.
- b. Min seq ID from original input (default 0). This parameter is analogous to 1.b (above) except that it applies to the ASH structural neighbors.
-
Add sequences. This feature allows you to pull in sequences from the UniRef
database. The options are similar to those above.
- a. Max seq ID between added sequences (default 90%). This option is analogous to 1.a (above) except that it applies to the Uniref100 sequences. Be careful about making this too large as there are potentially many homologous sequences.
- b. Min seq ID from original input. (default 0). This parameter is analogous to 1.b (above) except that it applies to the UniRef100 sequences. Again, be careful about adding too many sequences, unless you are sure that is what you want.
The output of Prep-MAFFTash is just a MAFFTash-formatted input file. You can paste it into the MAFFTash text window or upload it as a file.
[MAFFTash Output]
MAFFTash will send an email containing a link to your results. The results consist of a FASTAformatted multiple sequence alignment (a text file) as well as a Jalview[11] link from which you can view the multiple sequence alignment in your web browser.
Figure 1. MAFFTash alignment viewed through
[References]
- Standley, Toh, Nakamura,ASH structure alignment package: sensitivity and selectivity in domain classification., BMC Bioinformatics 8 (4),116 (2007) Link
- Standley, Toh, Nakamura,GASH: an improved algorithm for maximizing the number of equivalent residues between two protein structures., BMC Bioinformatics 6 ,221,(2005) Link
- Standley, Toh, Nakamura,Detecting local structural similarity in proteins by maximizing number of equivalent residues, Proteins 57 (2),381-91 (2004) Link
- Katoh, Asimenos, Toh,Multiple Alignment of DNA Sequences with MAFFT. In Bioinformatics for DNA Sequence Analysis edited by D. Posada, Methods in Molecular Biology 537,39-64 (2009) Link
- Katoh, Toh,Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework., BMC Bioinformatics 9,212 (2008) Link
- Katoh, Toh,Recent developments in the MAFFT multiple sequence alignment program., Briefings in Bioinformatics 9, 286-298 (2008) Link
- Katoh, Toh,PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences., Bioinformatics 23, 372-374 (2007) Link Errata
- Katoh, Kuma, Toh, Miyata,MAFFT version 5: improvement in accuracy of multiple sequence alignment., Nucleic Acids Res. 33, 511-518 (2005) Link
- Katoh, Misawa, Kuma, Miyata,MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform., Nucleic Acids Res. 30, 3059-3066 (2002) Link
- Li, Jaroszewski, Godzik,Clustering of highly homologous sequences to reduce the size of large protein databases., Bioinformatics 17, 282-283 (2001) Link
- Waterhouse, Procter, Martin, Clamp, Barton,Jalview Version 2--a multiple sequence alignment editor and analysis workbench., Bioinformatics 25 (9), 1189-119 (2009) Link