The amount of sequences being generated by genome projects by far exceeds our ability to manually assign them meaningful biological annotation. Automated methods are essential to analyze the flood of "unknown" or "hypothetical" sequences in a reasonable time frame. A frequent assumption facilitating automated analysis is that sequences have the same function as their closest relative. Using best BLAST hits to find these close relatives may be a viable option in many cases, however it was shown that best blast hits are not necessarily the closest sequence relatives (Koski & Golding 2001), thereby casting doubt on the reliability of this approach. Calculating a phylogeny for each sequence and using the trees to find the closest relatives is a more robust if computation intensive approach.
A good alignment is the basis for a good phylogeny, as misaligned regions, wrong gapping or unfortunate selection of sequence representatives can lead to erroneous trees. Sequence selection and alignment therefore seem to us the most critical steps on the path from starting sequence to phylogeny. When producing alignments it is necessary to decide between aligning full-length sequences and aligning only conserved regions for which sequence similarity, presumably due to shared descent, is unambiguously determinable. Using conserved regions only, may cause remotely related but still alignable regions to be missed, but it also greatly reduces the likelihood of aligning nonhomologous regions as may be found in multiple domain, fused or circularly permuted proteins.
BLAST and PSI-BLAST are currently the methods of choice for local sequence similarity detection as these programs are fast, reliable and sensitive.The alignments they generate unfortunately are subject to excessive and inconsistent gapping, caused by converting pairwise, local alignments to multiple sequence alignments.
We have implemented a few post-processing steps that alleviate the above mentioned problems. First, full length sequences for all HSP's (High Scoring Segment Pairs) are extracted. Then the HSP's with E-values better than a specified cutoff are converted to a multiple alignment. To remedy the excessive and inconsistent gapping problems this step generates, the gapped regions in the resulting alignment are realigned using a global alignment program (here clustalw).
To increase sensitivity and better define alignable sequence regions, a profile-Hidden-Markov Model (HMM) search against the full length sequences extracted in the first step can be performed. Generating a HMM is beneficial for two reasons: 1. Assigning insert states to highly variable regions removes most of the edge-creep effect present in PSIBLAST searches 2. The increased sensitivity of profile search methods can recover sequence regions a simple BLAST search missed.Once all alignments are generated, phylogenies can be infered by any tree construction program producing newick format (new Hampshire bracket format) trees.
A large repository of phylogenetic trees is mostly useless unless a way of separating relevant from irrelevant data for the question at hand is provided. We provide a tool to reduce the number of trees that have to be manually examined by extracting from the database all 'interesting' trees, i.e. all those containing specific topological features (see README for PHAT (Phylome Analysis Tool) included in the PhyloGenie package)