Hal: a program for semi-automated
phylogenomic analyses
B. Robbertse, J. Reeves, C.
Schoch and J.W. Spatafora
Dept. Botany and Plant
Pathology
Oregon State University
Hal version 1.1. The current version of Hal is a set of scripts written in PERL v5.8.4 that
takes all versus all BLASP text output (from any number of proteomes) as input
to identify orthologs, extract orthologous sequences and concatenate them into
a super alignment. These steps are
achieved by linking existing programs such as Tribe MCL and Clustal W with PERL
scripts. A flow diagram of the
method is presented in Figure 1 and a more detailed description along with a
phylogenomic analysis of ascomycete fungi will be published soon. Additional modified scripts were also
developed to parse sets of TBLASTN and BLASTX searches in order to extract
orthologous sequences from unannotated genomes and EST libraries. Figure 2 is
an example of the phylogenetic relationships between 17 taxa when using a super alignment produced by Hal from annotated genomes. Several fungal genomes are being sequenced and it is our
goal to expand the number of taxa in the analysis to incorporate all sequenced
genomes in the kingdom Fungi.
Identification of
orthologs. Identification of orthologs begins with an all versus
all BLASTP search (Altschul et al., 1990)
performed on annotated genomes. Blast results are then analyzed with the
program Tribe-MCL (Enright et al., 2002; van Dongen, 2000),
which utilizes Markov Clustering (MCL) by creating a similarity matrix from
e-values and then clusters proteins into related groups. The main parameter
that influences the size of a cluster in Tribe-MCL is the inflation value,
which can be adjusted from 1.1 (producing fewer cluster but with more members)
to 5.0 (producing more clusters with fewer members). Using a range of settings
for the inflation value (1.1 to 5.0) and filtering for clusters that consisted
of one protein from each proteome in the analysis we are able to find putative
orthologs. Only clusters of orthologs with best hits to a protein within the
same cluster, are considered for phylogenetic analysis. The protein sequences
used for the phylogenetic analysis are those reported for each protein when
compared to its best hit in the BLASTP results.
Alignments and model
assignment. Using Perl scripts
(http://spatafora.science.oregonstate.edu/node/view/145), identified
sequences are extracted from the BLASTP report, concatenated, parsed into fasta
format and each cluster of orthologous protein sequences are aligned with
ClustalW (Thompson et al., 1997).
All aligned orthologous clusters are concatenated into a single alignment and
converted into Phylip format. ProtTest v1.2.6 (Abascal et al., 2005)
is used to estimate the best model of evolution for each set of orthologs and
the concatenated super alignment. Gap-containing columns are excluded from
phylogenetic analyses.
Phylogenetic analyses. Tree construction can be performed using standard
phylogenetic algorithms, including maximum likelihood (ML) with Phyml (Guindon and Gascuel, 2003),
maximum parsimony (MP) in
Paup*v10b (Swofford, 2002), and neighbor
joining (NJ) distance analyses using PHYLIP v3.65 (Felsenstein, 1981). Models used in
ML analyses are chosen as either the best fit model for the super alignment or
as the most frequent model from individual orthologs. Support for the nodes of
the phylogenetic tree are estimated using 100 bootstrap replicates.
Figure 1. A flowchart showing the steps taken to
produce a super alignment by linking existing programs (grey ovals) with perl
scripts (yellow boxes) which feed in/output (white boxes) to the next
step. The parsing of BLASTP output
into a super alignment is fully automated and performed by the program Hal, which takes approximately 4 hours to complete the
task.

Figure 2. A tree showing phylogenetic
relationships using a super alignment from 781 orthologous amino acid sequences
of 17 fungal taxa and analyzed by maximum
parsimony. Numbers at the
nodes are bootstrap partitions from 100 nonparametric bootstrap replications.

References.
Abascal, F., et al., 2005. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 21, 2104-2105.
Altschul, S. F., et al., 1990. Basic local alignment search tool. J Mol Biol. 215, 403-10.
Enright, A. J., et al., 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575-84.
Felsenstein, J., 1981. PHYLIP: Phylogeny inference package (version 3.2). Cladistics. 5, 164-166.
Guindon, S., Gascuel, O., 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52, 696-704.
Swofford, D. L., PAUP*. Phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, Massachusetts, U.S.A., 2002.
Thompson, J. D., et al., 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876-82.
van Dongen, S., Graph Clustering by Flow Simulation. University of Utrecht, Utrecht, 2000.